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RECONFIGURABLE PROCESSING 

Field of the Invention 

This invention relates to the accomplishment of moderately complex computer 
applications by a combination of hardware and software, and more particularly to methods of 
5 optimizing the implementation of portions of such computer applications in hardware, 
hardware thus produced, and to the resultant combination of hardware and software. 

Cross-Reference to Related Applications 

This application claims priority from provisional patent application Serial No. 
60/445,339 filed February 5, 2003 in the name of Aravind R. Dasu et al. entitled 

10 "Reconfigurable Processing," provisional patent application Serial No. 60/490,162 filed July 
24, 2003 in the name of Aravind R. Dasu et al. entitled 11 Algorithm Design for Zone Pattern 
Matching to Generate Cluster Modules and Control Data Flow Based Task Scheduling of the 
Modules," provisional patent application Serial No. 60/493,132 filed August 6, 2003 in the 
name of Aravind R. Dasu et al. entitled "Heterogeneous Hierarchical Routing Architecture," 

15 and provisional patent application Serial No. 60/523,462 filed November 18, 2003 in the 
name of Aravind R. Dasu et al. entitled "Methodology to Design a Dynamically 
Reconfigurable Processor," all of which are incorporated herein by reference. 

Background 

A number of techniques have been proposed for improving the speed and cost of 
20 moderately complex computer program applications. By moderately complex computer 
programming is meant programming of about the same general level of complexity as 
multimedia processing. 

Multimedia processing is becoming increasingly important with wide variety of 
applications ranging from multimedia cell phones to high definition interactive television. 
25 Media processing involves the capture, storage, manipulation and transmission of 
multimedia objects such as text, handwritten data, audio objects, still images, 2D/3D 
graphics, animation and full-motion video. A number of implementation strategies have 
been proposed for processing multimedia data. These approaches can be broadly classified 
based on the evolution of processing architectures and the functionality of the processors. 
30 In order to provide media processing solutions to different consumer markets, designers 
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have combined some of the classical features from both the functional and evolution based 
classifications resulting in many hybrid solutions. 

Multimedia and graphics applications are computationally intensive and have 
been traditionally solved in 3 different ways. One is through the use of a high speed 
5 general purpose processor with accelerator support, which is essentially a sequential 
machine with enhanced instruction set architecture. Here the overlaying software 
bears the burden of interpreting the application in terms of the limited tasks that the 
processor can execute (instructions) and schedule these instructions to avoid resource 
and data dependencies. The second is through the use of an Application Specific 

10 Integrated Circuit (ASIC) which is a completely hardware oriented approach, spatially 
exploiting parallelism to the maximum extent possible. The former, although slower, 
offers the benefit of hardware reuse for executing other applications. The latter, albeit 
faster and more power, area and time efficient for a specific application, offers poor 
hardware reutilization for other applications. The third is through specialized 

15 programmable processors such as DSPs and media processors. These attempt to 
incorporate the programmability of general purpose processors and provide some 
amount of spatial parallelism in their hardware architectures. 

The complexity, variety of techniques and tools, and the high computation, storage 
and I/O bandwidths associated with multimedia processing presents opportunities for 

20 reconfigurable processing to enables features such as scalability, maximal resource utilization 
and real-time implementation. The relatively new domain of reconfigurable solutions lies in 
the region of computing space that offers the advantages of these approaches while 
minimizing their drawbacks. Field Programmable Gate Arrays (FPGAs) were the first 
attempts in this direction. But poor on-chip network architectures lead to high reconfiguration 

25 times and power consumptions. Improvements over this design using Hierarchical Network 
architectures with RAM style configuration loading have lead to a factor of two to four times 
reduction in individual configuration loading times. But the amount of redundant and 
repetitive configurations still remains high. This is one of the important factors that leads to 
the large overall configuration times and high power consumption compared to ASIC or 

30 embedded processor solutions. 

A variety of media processing techniques are typically used in multimedia 
processing environments to capture, store, manipulate and transmit multimedia objects such 
as text, handwritten data, audio objects, still images, 2D/3D graphics, animation and full- 
motion video. Example techniques include speech analysis and synthesis, character 
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recognition, audio compression, graphics animation, 3D rendering, image enhancement and 
restoration, image/video analysis and editing, and video transmission. Multimedia 
computing presents challenges from the perspectives of both hardware and software. For 
example, multimedia standards such as MPEG-1, MPEG-2, MPEG-4, MPEG-7, H.263 and 
5 JPEG 2000 involve execution of complex media processing tasks in real-time. The need for 
real-time processing of complex algorithms is further accentuated by the increasing interest 
in 3-D image and stereoscopic video processing. Each media in a multimedia environment 
requires different processes, techniques, algorithms and hardware. The complexity, variety 
of techniques and tools, and the high computation, storage and UO bandwidths associated 
10 with processing at this level of complexity presents opportunities for reconfigurable 

processing to enables features such as scalability, maximal resource utilization and real- 
time implementation. 

To demonstrate the potential for reconfiguration in multimedia computations, the 
inventors have performed a detailed complexity analysis of the recent multimedia standard 
1 5 MPEG-4. The results show that there are significant variations in the computational 

complexity among the various modes/operations of MPEG-4. This points to the potential for 
extensive opportunities for exploiting reconfigurable implementations of multimedia/ 
graphics algorithms. 

The availability of large, fast, FPGAs (field programmable gate arrays) is 
20 making possible reconfigurable implementations for a variety of applications. FPGAs 
consist of arrays of Configurable Logic Blocks (CLBs) that implement various logical 
functions. The latest FPGAs from vendors like Xilinx and Altera can be partially 
configured and run at several megahertz. Ultimately, computing devices may be able to 
adapt the underlying hardware dynamically in response to changes in the input data or 
25 processing environment and process real time applications. Thus FPGAs have 

established a point in the computing space which lies in between the dominant extremes 
of computing, ASICS and software programmable/ instruction set based architectures. 
There are three dominant features that differentiate reconfigurable architectures from 
instruction set based programmable computing architectures and ASICs: (i) spatial 
30 implementation of instructions through a network of processing elements with the 
absence of explicit instruction fetch-decode model (ii) flexible interconnects which 
support task dependent data flow between operations (iii) ability to change the 
Arithmetic and Logic functionality of the processing elements. The reprogrammable 
space is characterized by the allocation and structure of these resources. Computational 
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tasks can be implemented on a reconfigurable device with intermediate data flowing 
from the generating function to the receiving function. The salient features of 
reconfigurable machines are: 

• Instructions are implemented through locally configured processing elements, 
5 thus allowing the reconfigurable device to effectively process more instructions into 

active silicon in each cycle. 

• Intermediate values are routed in parallel from producing functions to 
consuming functions (as space permits) rather than forcing all communication to take 
place through a central resource bottleneck. 

10 • Memory and interconnect resources are distributed and are deployed based on 

need rather than being centralized, hence presenting opportunities to extract parallelism 
at various levels. 

The networks connecting the Configuration Logic Blocks or Units (CLBs) or 
processing elements can range from full connectivity crossbar to neighbor only connecting 
1 5 mesh networks. The best characterization to date which empirically measures the growth in 
the interconnection requirements with respect to the number of Look-Up Tables (LUTs) is 
the Rent's rule which is given as follows: 

N'^CNP gates 

where N io corresponds to the number of interconnections (in/out lines) in a region 
20 containing Ngates- C and p are empirical constants. For logical functions typically p ranges 
from 0.5<p<0.7. 

It has been shown [I] (by building the FPGA based on Rent's model and using a 
hierarchical approach) that the configuration instruction sizes in traditional FPGAs are higher 
than necessary, by at least a factor of two to four. Therefore for rapid configuration, off-chip 

25 context loading becomes slow due to the large amount of configuration data that must be 
transferred across a limited bandwidth I/O path. It is also shown that greater word widths 
increase wiring requirements, while decreasing switching requirements. In addition, larger 
granularity data paths can be used to reduce instruction overheads. The utility of this 
optimization largely depends on the granularity of the data which needs to be processed. 

30 However, if the architectural granularity is larger than the task granularity, the device's 
computational power will be under utilized. Another promising development in efforts to 
reduce configuration time is shown in [2]. 
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Most of the current approaches towards building a reconfigurable processor are 
targeted towards performance in terms of speed and are not tuned for power awareness or 
configuration time optimization. Therefore certain problems have surfaced that need to be 
addressed at the pre-processing phase. 
5 First, the granularity or the processing ability of the Configurable Logic Units (CLUs) 

must be driven by the set of applications that are intended to be ported onto the processing 
platform. Some research groups have taken the approach of visual inspection [3], while 
others have adopted algorithms of exponential complexity [4,5] to identify regions in the 
application's Data Flow Graphs (DFGs) that qualify for CLUs. None of the current 
1 0 approaches attempt to identify the regions through an automated low complexity approach 
that deals with Control Data Flow Graphs (CDFGs). 

Secondly, the number of levels in hierarchical network architecture must be 
influenced by the number of processing elements or CLUs needed to complete the task / 
application. This in turn depends on the amount of parallelism that can be extracted from the 
1 5 algorithm and the percentage of resource utilization. To the best of our knowledge no 
research group in the area of reconfigurable computing has dealt with this problem. 

Thirdly, the complex network on the chip, makes dynamic scheduling expensive as it 
adds to the primary burden of power dissipation through routing resource utilization. 
Therefore there is a need for a reconfiguration aware scheduling strategy. Most research 
20 groups have adopted dynamic scheduling for a reconfigurable accelerator unit through a 
scheduler that resides on a host processor [6,7]. 

The increasing demand for fast processing, high flexibility and reduced power 
consumption naturally demand the design and development of a low configuration time 
aware-dynamically reconfigurable processor. 
25 It is an object, therefore, to provide a low area, low power consuming and fast 

reconfigurable processor. 

Task scheduling [1] is an essential part of the design cycle of hardware 
implementation for a given application. By definition, scheduling refers to the ordering of 
sub-tasks belonging to an application and the allocation of resources to these tasks. Two 
30 types of scheduling techniques are static and dynamic scheduling. Any application can be 
modeled as a Control-Data Flow Graph. Most of the current applications provide a large 
amount of variations to users and hence are control-dominated. To arrive at an optimal static 
schedule for such an application would involve a highly complex scheduling algorithm. 
Branch and Bound is an example of such an algorithm with exponential complexity. Several 
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researchers have addressed task scheduling and one group has also addressed scheduling for 
conditional tasks. 

Any given application can be modeled as a CDFG G(V,E). V is the set of all nodes of 
the graph. Theses nodes represent the various tasks of the CDFG. E is the set of all 
5 communication edges. These edges can be either conditional or unconditional. There are two 
possible methods of scheduling this CDFG which have been listed below. 

Static scheduling of tasks is done at compile time. It is assumed that lifetimes of all 
the nodes are known at compile time. The final schedule is stored on-chip. During run-time, 
if there is a mistake in the assumption of lifetime of any node, then the schedule information 
10 needs to be updated. Advantage of this method is that worst-case execution time is 

guaranteed. But, a static schedule is always worse than a dynamic schedule in terms of 
optimality. Some of the existing solutions for static scheduling are stated here. 

Chekuri [2] discusses the earliest branch node retirement scheme. This is applicable 
for trees and s-graphs. An s-graph is a graph where only one path has weighted nodes. In this 
1 5 case, it is a collection of Directed Acyclic Graphs (DAGs) representing basic blocks which 
all end in branch nodes, and the options at the branch nodes are: exit from the whole graph or 
exit to another branch node. The problem with this approach is that it is applicable only to 
small graphs and also restricted to S-graphs and trees. It also does not consider nodes mapped 
to specific processing elements. 
20 Pop [3] tackles control task scheduling in 2 ways. The first is partial critical path 

based scheduling. But they do not assume that the value of the conditional controller is 
known prior to the evaluation of the branch operation. They also propose the use of a branch 
and bound technique for finding a schedule for every possible branch outcome. This is quite 
exhaustive, but it provides an optimal schedule. Once all possible schedules have been 
25 obtained, the schedules are merged. The advantages are that it is optimal, but it has the 

drawback of being quite complex. It also does not consider loop structures. Scheduling of 
tasks is done during run-time. Main advantage of such an approach is that there is no need for 
a schedule to be stored on-chip. Moreover, the schedule obtained is optimal. But, a major 
limiting factor is that the schedule information needs to be communicated to all the 
30 processing elements on the chip at all time. This is a degrading factor in an architecture 
where interconnects occupy 70% of total area. 

Jha [4] addresses scheduling of loops with conditional paths inside them. This is a 
good approach as it exploits parallelism to a large extent and uses loop unrolling. But the 
drawback is that the control mechanism for having knowledge of each iteration and the 
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resource handling that iteration is very complicated. This is useful for one or two levels of 
loop unrolling. It is quite useful where the processing units can afford to communicate quite 
often with each other and the scheduler. But in our case, the network occupies about 70% of 
the chip area [6] and hence cannot afford to communicate with each other too often. 
5 Moreover the granularity level of operation between processing elements is beyond a basic 
block level and hence this method is not practical. 

Mooney [5] discusses a path based edge activation scheme. This means that if for a 
group of nodes (which must be scheduled onto the same processing unit and whose schedules 
are affected by branch paths occurring at a later stage) one knows ahead of time the branch 

10 controlling values, then one can at run time prepare all possible optimized list schedules for 
every possible set of branch controller values. This method is very similar to the partial 
critical path based method proposed by Pop discussed above. It involves the use of a 
hardware scheduler, which is an overhead. 

Existing research work on scheduling applications for reconfigurable devices has been 

15 focused on context-scheduling. A context is the bit-level information that is used to configure 
any particular circuit to do a given task. A brief survey of research done in this area is given 
here. 

Noguera [7] proposes a dynamic scheduler and four possible scheduling algorithms to 
schedule contexts. These contexts are used to configure the Dynamic Reconfiguration Logic 
20 (DRL) blocks. This is well-suited for applications which have non-deterministic execution 
times. 

Schmidt [8] aims to dynamically schedule tasks for FPGAs. Initially, all the tasks are 
allocated as they come till the entire real estate is used up. Schmidt proposes methods to 
reduce the waiting time of the tasks arriving next. A proper rearrangement of tasks currently 
25 executing on the FPGA is done in order to place the new task. A major limitation of this 

method is that it requires knowing the target architecture while designing the rearrangement 
techniques. 

Fernandez [9] discusses a scheduling strategy that aims to allocate tasks belonging to 
a DFG to the proposed MorphoSys architecture. All the tasks are initially scheduled using a 
30 heuristic-based method which minimizes the total execution time of the DFG. Context 

loading and data transfers are scheduled on top of the initial schedule. Fernandez tries to bide 
context loading and data transfers behind the computation time of kernels. A main drawback 
is that this method does not apply for CDFG scheduling. 
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Bhatia [10] proposes a methodology to do temporal partitioning of a DFG and then 
scheduling the various partitions. The scheduler makes sure that the data dependence between 
the various partitions is maintained. This method is not suited for our purpose which needs 
real-time performance. 
5 Memik [1 1] describes super-scheduler to schedule DFGs for reconfigurable 

architectures. He initially allocates the resources to the most critical path of the DFG. Then 
the second most critical path is scheduled and so on. Scheduling of paths is done using Non- 
crossing Bipartite matching. Though the complexity of this algorithm is less, the schedule is 
nowhere near optimal. 

10 Jack Liu [12] proposes Variable Instruction Set Computer (VTSC) architecture. 

Scheduling is done at the basic block level. An optimal schedule to order the instructions 
within a basic block has been proposed. This order of instructions is used to determine the 
hardware clusters. 

An analysis of the existing work on scheduling techniques for reconfigurable 

1 5 architectures has shown that there is not enough work done on static scheduling techniques 
for CDFGs. This shows the need for a novel method to do the same. 

The VLSI chip design cycle includes the steps of system specification, 
functional design, logic design, circuit design, physical design, fabrication and 
packaging. The physical design automatic of FPGA involves three steps which include 

20 partitioning, placement and routing. 

Despite advances in VLSI design automation, the time it takes to market for a chip is 
unacceptable for many applications. The key problem is time taken due to fabrication of 
chips and therefore there is a need to find new technologies, which minimize the fabrication 
time. Gate Arrays use less time in fabrication as compared to full custom chips since only 

25 routing layers are fabricated on top of pre-fabricated wafer. However fabrication time for gate 
arrays is still unacceptable for several applications. In order to reduce the time to fabricate 
interconnects; programmable devices have been introduced which allow users to program the 
devices as well as interconnect. 

FPGA is a new approach to ASIC design that can dramatically reduce manufacturing 

30 turn around time and cost. In its simplest form an FPGA consists of regular array of 
programmable logic blocks interconnected by a programmable routing network. A 
programmable logic block is a RAM and can be programmed by the user to act as a small 
logic module. The key advantage of FPGA is re-programmability. 
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The VLSI chip design cycle includes the steps of system specification, functional 
design, logic design, circuit design, physical design, fabrication and packaging. Physical 
design includes partitioning, floor planning, placement, routing and compaction. 

The physical design automation of FPGAs involves three steps, which include 
5 partitioning, placement, and routing. Partitioning in FPGAs is significantly different than the 
partitioning s in other design styles. This problem depends on the architecture in which the 
circuit has to be implemented. Placement in FPGAs is very similar to the gate array 
placement. Routing in FPGAs is to find a connection path and program the appropriate 
interconnection points. In this step the circuit representation of each component is converted 
10 into a geometric representation. This representation is a set of geometric patterns, which 
perform the intended logic function of the corresponding component. Connections between 
different components are also expressed as geometric patterns. Physical design is a very 
complex process and therefore it is usually broken into various subsets. 

The input to the physical design cycle is the circuit diagram and the output is the 
15 layout of the circuit. This is accomplished in several stages such as partitioning, floor 
planning, placement, routing and compaction. 

A chip may contain several transistors. Layout of the entire circuit cannot be handled 
due to the limitation of memory space as well as computation power available. Therefore it is 
normally partitioned by grouping the components into blocks. The actual partitioning process 
20 considers many factors such as the size of the blocks, number of blocks, and the number of 
interconnections between the blocks. The set of interconnections required is referred as a net 
list. In large circuits the partitioning process is hierarchical and at the topmost level a chip 
may have 5 to 25 blocks. Each block is then partitioned recursively into smaller blocks. 

This step is concerned with selecting good layout alternatives for each block as well 
25 as the entire chip. The area of each block can be estimated after partitioning and is based 

approximately on the number and type of commonness in that block. Li addition interconnect 
area required within the block must also be considered. Very often the task of floor plan 
layout is done by a design engineer rather than a CAD tool due to the fact that human is 
better at visualizing the entire floor plan and take into account the information flow. In 
30 addition certain components are often required to be located at specific positions on the chip. 
During placement the blocks are exactly positioned on the chip. The goal of placement is to 
find minimum area arrangement for the blocks that allows completion of interconnections 
between the blocks while meeting the performance constraints. Placement is usually done in 
two phases. In the first phase initial placement is done. In the second phase the initial 
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placement is evaluated and iterative improvements are made until layout has minimum area 
or best performance. 

The quality of placement will not be clear until the routing phase has been completed. 
Placement may lead to un-routable design. In that case another iteration of placement is 
necessary. To limit the number of iterations of the placement algorithm an estimate of the 
required routing space is used during the placement process. A good routing and circuit 
performance heavily depend on a good placement algorithm. This is due to the fact that once 
the position of the block is fixed; there is not much to do to improve the routing and the 
circuit performance. 

The objective of routing is to complete the interconnection between the blocks 
according to the specified net list. First the space that is not occupied by the blocks (routing 
space) is partitioned into rectangular regions called channels and switchboxes. This includes 
the space between the blocks. The goal of the router is to complete all circuit connections 
using the shortest possible wire length and using only the channel and switch boxes. This is 
usually done in two phases referred as global routing and detailed routing phases. In global 
routing connections are completed between the proper blocks disregarding the exact 
geometric details of each wire. For each wire global router finds a list of channels and 
switchboxes to be used as passageway for that wire. Detailed routing that completes point-to- 
point connections follows global routing. Global routing is converted into exact routing by 
specifying the geometric information such as location and spacing of wires. Routing is a very 
well defined studied problem. Since almost all routing problems are computationally hard the 
researchers have focused on heuristic algorithms. 

Compaction is the task of compressing the layout in all directions such that the total 
area is reduced. By making the chip smaller wire lengths are reduced which in turn reduces 
the signal delay. 

Generally approaches to global routing are classified as sequential and concurrent 
approaches. 

In one approach nets are routed one by one. If a net is routed it may block other nets 
which are to be routed. As a result this approach is very sensitive to the order of the nets that 
are considered for routing. Usually the nets are ordered with respect to their criticality. The 
criticality of a net is determined by the importance of the net. For example a clock net may 
determine the performance of the circuit so it is considered highly critical. However 
sequencing techniques don't solve the net ordering problem satisfactorily. An improvement 
phase is used to remove blockages when further routing is not feasible. This may also not 
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solve the net ordering problem so in addition to that ftp-up and reroute' technique [Bol79, 
DK82] and 'shove-aside 1 techniques are used. In rip-up and reroute the interfering wires are 
ripped up and rerouted to allow routing of affected nets. Whereas in shove aside technique 
wires that allow completion of failed connections are moved aside without breaking the 
5 existing connection. Another approach [De86] is to first route simple nets consisting of only 
two or three terminals since there are few choices for routing sucH nets. After the simple nets 
are routed, a Steiner Tree algorithm is used to route intermediate nets. Finally a maze routing 
algorithm is used to route the remaining multi-terminal nets that are not too numerous. 

To match the needs of the future moderately complex applications, provided is the 
10 first of a series of tools intended to help in the design and development of a dynamically 
reconfigurable multimedia processor. 

Brief Summary 

In accordance with this invention, designing processing elements based on identifying 
correlated compute intensive regions within each application and between applications results 

15 in large amounts of processing in localized regions of the chip. This reduces the amount of 
reconfigurations and hence faster application switching. This also reduces the amount of on- 
chip communication, which in turn helps reduce power consumption. Since applications can 
be represented as Control Data Flow Graphs (CDFGs) such a pre-processing analysis lies in 
the area of pattern matching, specifically graph matching. In this context a reduced 

20 complexity, yet exhaustive enough graph matching algorithm is provided. The amount of on- 
chip communication is reduced by adopting reconfiguration aware static scheduling to 
manage task and resource dependencies on the processor. This is complemented by a divide 
and conquer approach which helps in the allocation of an appropriate number of processing 
units aimed towards achieving uniform resource utilization. 

25 In accordance with one aspect of the present invention a control data flow graph is 

produced from source code for an application having complexity approximating that of 
MPEG-4 multimedia applications. From the control data flow graph are extracted basic 
blocks of code represented by the paths between branch points of the graph. Intermediate 
data flow graphs then are developed that represent the basic blocks of code. Clusters of 

30 operations common to the intermediate data flow graphs are identified. The largest common 
subgraph is determined from among the clusters for implementation in hardware. 
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Efficiency is enhanced by ASAP scheduling of the largest common subgraph. The 
ASAP scheduled largest common subgraph then is applied to the intermediate flow graphs to 
which the largest common subgraph is common. The intermediate flow graphs then are 
scheduled for reduction of time of operation. This scheduling produces data patches 

5 representing the operations and timing of the scheduled intermediate flow graphs having the 
ASAP scheduled largest common subgraph therein. The data patches are then combined to 
include the operations and timing of the largest common subgraph and the operations and 
timing of each of the intermediate flow graphs that contain the largest common subgraph. 

At this point, it will be appreciated, the utilization of the hardware that represents the 

10 ASAP-scheduled largest common subgraph by the operations of each implicated intermediate 
flow graph needs scheduling. Bearing in mind duration of use of the hardware representing 
the largest common subgraph by the operations of each of the implicated intermediate flow 
graphs, hardware usage is scheduled for fastest completion of the combined software and 
hardware of operations of all affected intermediate flow graph as represented in the combined 

15 data patches. Method of scheduling according to the present invention treats reconfiguration 
edges in the same way as communication edges and includes the reconfiguration overhead 
while determining critical paths. This enables employment of the best CDFG scheduling 
technique and incorporation of the reconfiguration edges. 

Our target architecture is a reconfigurable architecture. This adds a new dimension to 

20 the CDFG discussed above. A new type of edge between any two nodes of the CDFG that 
will be implemented on the same processor is possible. Let us call this a "Reconfiguration 
edge". A reconfiguration time can be associated with this edge. This information must be 
accounted for while scheduling this modified CDFG. 

To realize the largest common flow graph in hardware, processor component layout 

25 and interconnections by ~ connective fabric needs to be addressed. 

In accordance with the invention, a tool set that will aid the design of a dynamically 
reconfigurable processor through the use of a set of analysis and design tools is provided. A 
part of the tool set is a heterogeneous hierarchical routing architecture. Compared to 
hierarchical and symmetrical FPGA approaches building blocks are of variable size. This 

30 results in heterogeneity between groups of building blocks at the same hierarchy level as 
opposed to classical H-FPGA approach. Also in accordance with this invention a 
methodology for the design and implementation of the proposed architecture, which involves 
packing, hierarchy formation, placement, network scheduler tools, is provided. 

12 
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The steps of component layout and interconnect! vity involve (1) partitioning - cluster 
recognition and extraction, (2) placement - the location of components in the available area 
on a chip, and (3) routing - the interconnection of components via conductors and switches 
with the goal of maximum speed and minimum power consumption. 
5 Detailed Description 

Turning to Fig. 1, source code in C or C++ for an MPEG4 multimedia application that 
includes a pair of its operations " Affine Transform," and 11 Perspective," are input to a Lance 
compiler utility 101 running its "Show CFG" operation. This outputs Control Flow Graphs 
(DFGs). Control Flow Graphs for the Affin e Transform and Perspective are shown in Fig. 2. 

10 As seen in the Affine CFG of Fig. 2, the Affine Transform Control Flow Graph is composed 
of a series of basic blocks 106, 108, 110, 1 12 and 1 14. The CFG of the multimedia 
component Perspective is similarly composed of basic blocks. CFGs output by the Lance 
compiler utility 101 are actually more textual than their depictions in Fig. 2, but are readily 
understood to describe basic blocks and their interconnections. The Affine Transform has a 

15 number of its blocks 108, 1 10, 1 12 arranged in loops. Whereas block 106 is a preloop listing. 

Visually, at present, the many CFGs of the multimedia application are inspected for 
similarity among large control blocks. How big the candidate blocks should be is a 
judgement call. Similar blocks of more than 50 lines in two or more CFGs are good 
candidates for development of a Largest Common Flow Graph among them whose operations 

20 are to be shared as described below. Smaller basic blocks can similarly be subjected to the 
development of largest common flow graphs as described below, but at some point the 
exercise returns insignificant time and cost savings. The Affine Transform preloop basic 
block has 70 instructions. The Perspective preloop basic block 118 has 85 instructions. 
Those instructions are as follows: 
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Affine preloop basic block 106 

t541 = s_178/2; 

t348 = 2*i0_166; 
5 t349 = t348 + du0_172; 

t350 = t541 *t349; 

t352 = 2*j0_167; 

t353=t352 + dv0_173; 

t354-t541 *t353; 
10 t356 = 2*il_168; 

t357 = t356 + dul_174; 

t358 = t357 + du0_172; 

t359 = t541 *t358; 

t361=2*jl_169; 
15 t362 = t361 +dvl_175; 

t363 = t362 + dv0_173; 

t364 = t541 *t363; 

t366 = 2*i2_170; 

t367 = t366 + du2_176; 
20 t368 = t367 + du0_172 

t369 = t541 *t368; 

t371=2*j2_171; 

t372 = t371 +dv2_177; 

t373 = t372 + dv0_173; 
25 t374 = t541 * t373; 

t542 = 256; 

t375 = i0_166 + t542; 

t376 = 16 * t375; 

t543=r_179*t359; 
30 t544=16*il_168; 

t21=t543-t544; 

t381 =-80*t21; 

t385 = t542*t21; 

t386 = t381 +t385; 
35 t545 = 176; 

t387 = t386/t545; 

t388 = t376 + t387; 

t546 = 16*j0_167; 

t547 = r_179*t354; 
40 t22 = t547-t546; 

t394 = -80 * t22; 

t395 = r_179*t364; 

t396=16*jl_169; 

t397 = t395-t396; 
45 t398=t542*t397; 

t399 = t394 + 1398; 

t400 = t399/t545; 

t401 = t546 + 1400; 

t548 = 16 * i0_166; 
50 . t404 - r_179 * t350; 
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A. A f\£i 

t4Uo = 


A A f\ A ifiO. 

t404 — 1548; 


t4U7 = 


i 1 i ak a a r\£T . 

-112 * t406; 


a a AO — 

t4Uo = 


r_l/9 * £369; 


4-ACiCS 

t4uy — 


16 * i2_170; 


A. A 1 A 

t41U = 


t408 — 1409; 


t411 = 


a i~ A ale x A t f\ 

t542 * t410; 


t412 = 


t407 + t4ll; 


t549 = 


t A A 

144; 


t413 = 


t412/t549; 


t414 = 


t548 + 1413; 


a A t C 

t415 = 


j0_167 + t542; 


t41o = 


16 * t415; 


1421 = 


-112 * t22; 


t422 = 


r_179 * t374; 


t423 = 


16 * j2_171; 


i/io/i 

t4z4 — 


VM2. — 1423; 


t425 = 


t542 * t424; 


t426 = 


t421 + 1425; 


t427 = 


t426/t549; 


t428 = 


t416 + 1427; 


i 185 = 


= 0; 



Perspective preloop basic block 118 

t 744 = s_221 / 2; 
t542 = 2 * i0_205; 
t543 = t542 + du0_213; 
t544 = t744 * t543; 
t546 = 2*j0_206; 
t547 = t546 + dv0_214; 
t548 = t744 * t547; 
t550 = 2*il_207; 
t551=t550 + dul_215; 
t552 = t551 +du0_213; 
t553 = t744 * t552; 
t555 = 2*jl_208; 
t556 = t555 + dvl_216; 
t557 = t556 + dv0_214; 
t558 = t744 * t557; 
t560 = 2 * i2_209; 
t561 =t560 + du2_217; 
t562 = t561 +du0_213; 
t563 = t744 * t562; 
t565 = 2*j2_210; 
t566 = t565+dv2_218; 
t567 = t566 + dv0_214; 
t568 = t744 * t567; 
t570 = 2*i3_211; 
t571=t570 + du3_219; 
t572 = t571+du2_217; 
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t573 = t572 + dul_215; 
t574 = t573-du0_213; 
t575 = t744 * t574; 
t577 = 2*j3_212; 
t578 = t577 + dv3_220; 
t579=t578 + dv2_218; 
t580 = t579 + dvl_216; 
t581 =t580 + dv0_214; 
t582 = t744*t581; 
t745=t544-t553; 
t28 = t745-t563; 
t34 = t28 + 1575; 
t746 = t568 - 1582; 
t587 = t34 * t746; 
t747 = t563-t575; 
t748 = t548-t558; 
t29-t748-t568; 
t35 = t29 + 1582; 
t592 = t747*t35; 
t593=t587-t592; 
t749 = 144; 
t594 = t593 * t749; 
t750 = t553-t575; 
t599 = t35 *t750; 
t751=t558-t582; 
t604 = t751 *t34; 
t605 = t599-t604; 
t752 = 176; 
t606 = t605 * t752; 
t609 = t750*t746; 
t612 = t747*t751; 
t613=t609-t612; 
t614 = t553-t544; 
t615 = t613 *t614; 
t616 = t615 *t749; 
t617 = t594*t553; 
t618 = t616 + t617; 
t619 = t563-t544; 
t620 = t613*t619; 
t621 =t620*t752; 
t622 = t606 * t563; 
t623=t621 +t622; 
t624 = t613 *t544; 
t625 = t624 * t752; 
t626 = t625 * t749; 
t627 = t558-t548; 
t628=t613 *t627; 
t629 = t628*t749; 
t630 = t594*t558; 
t631 =t629 + t630; 
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t632 = t568-t548; 
t633=t613 *t632; 
t634 = t633*t752; 
t635 = t606*t568; 
5 t636 = t634 + t635; 

t637=t613*t548; 
t638 = t637*t752; 
t639 = t638 *t749; 
i_228 = 0; 

10 At 120 in Fig. 1 the basic blocks are extracted from the CFGs 103 and 104 (Fig. 2) 

developed by the Lance utility 101 . The exemplary Affine and Perspective basic blocks are 
shown in Fig. 1 being input to the Lance compiler utility running its "Show DFG" operation 
to develop an Affine data flow graph and a perspective data flow graph at outputs 122 and 
123. The extraction of the basic blocks at 120 in Fig. 1 may be effected manually or by a 

1 5 simple program discarding low instruction count basic blocks prior to passing them along to 
the Lance compiler 101 for the production of the data flow graphs. The data flow graphs out 
of the Lance compiler are input to an operation by which pairs of data flow graphs are 
selected as candidates for development of a largest common subgraph. 

Remembering that many data flow graphs may have been produced from the 

20 multimedia application initially input to the Lance compiler utility 101, it is at this point that 
a selection process identifies the Affine and Perspective as good candidates for pairing to 
develop the desired largest common subgraph. That selection process is indicated at 124 in 
Fig. 1. Data flow graphs of the kind selected are shown in Figs. 4 (a) and (b). These are 
directed acyclic graphs (D AGs). This is to say, as indicated by the arrows in Figs. 4 (a) and 

25 (b), the operations move in a single direction from top to bottom and do not loop back. The 
rectangles of Fig. 4 (a) represent the instructions of the Affine preloop basic block 106 and 
the rectangles of Fig. 4 (b) represent the instructions of the Perspective preloop basic block 
118. 

Again visually, as currently implemented, these data flow graphs are compared for 
30 similarity and two or more are chosen. Again a simple program may be implemented for the 
same purpose as will be apparent. Individual comparison, like elements of the data flow 
graph are identically colored. The instructions contained in the individual rectangles of the 
data flow graphs of Figs. 4 (a) and 4 (b) are add (+), divide (/), multiply (*), subtract (-) and 
memory transaction (not shown). To make it visually easier to identify similarities, then, in 
35 the present, visual implementation, each type of instruction is color-coded blue, red, green, 
etc. In the example of Fig. 1, the data flow graphs for the Affine and Perspective preloop 
basic blocks have been chosen and are 
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input at 126 and 127 to a routine 129 to identify the Largest Common Subgraph (LCSG) 
shared by the two data flow graphs. One approach to development of the LCSG is discussed 
below under "Proposed Approach." 

Description of LCSG Development 

5 Fig. 5 illustrates the largest common subgraph developed from the Affine and 

Perspective preloop basic blocks. At 131 and 133, ASAP scheduling of the LCSG takes 
place in known fashion iteratively with the LCSG individually and with the LCSG inserted 
into the Data Flow Graphs until the most efficient scheduling of the Data Flow Graphs is 
realized at block 133. 

10 ASAP scheduling is a known technique. In the LCSG of Fig. 5 is accomplished by 

moving elements representing instructions upward where possible to permit their use more 
quickly and perhaps more quickly freeing a circuit component that effects that instruction for 
a further use. From the LCSG of Fig. 5 it will be seen that 33 instructions from each of the 
Affine and Perspective codes have now been identified to be implemented in hardware and 

15 shared by the two multimedia operations represented by the Affine and Perspective CFGs 
originally developed at 101. The same will be done for other Control Flow Graphs 
representing other portions of the multimedia application introduced at the compiler 101. 
Instructions not covered by a LCSG will be accomplished by general purpose processing 
LUTs on the ultimate chip. The output from the ASAP scheduling that occurs at 131 is an 

20 intermediate result or graph. Affine and Perspective DAGs with ASAP scheduling and the 
inclusion of the common LCSG are shown in Figs. 6 (a) and 6 (b). In Fig. 6 (a), for example, 
it will be seen that the instruction A 1 has been moved up from line 2 in Fig. 5 ! s unscheduled 
LCSG to the same line (line 1) as the instruction V. Likewise the instruction A 3 has been 
moved up so that there are now four like instructions in the first line of the LCSG portion of 

25 the Fig. 6 (a) Affine DAG requiring four processing elements. In the second line instruction 
A 2 and A 4 have been moved up and are now at the same line as instruction U and 
instruction X. These are all like instructions, so four like processing elements will be 
required to simultaneously run the four instructions. However, in Fig. 5, the LCSG, 
originally included ten circuit elements of a kind in a single line beginning with the element 

30 designated e, whereas now the largest number of such elements in a line of the LCSG in Fig. 
6 (a) is only six. The resistors Ri, R2... in Figs. 6 (a) and 6 (b) are inserted delays between 
executions of instructions. 
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Output from the block 133 are the scheduled Affine and Perspective graphs of Figs. 6 
(a) and 6 (b). At blocks 135 and 136 data paths are defined for each of these and at block 138 
data paths are combined to produce the code for the circuit Z in VHDL. . That code for the 



library ieee; 

use ieee.std_logic_l 164.all; 
use ieee.std_logic_arith.all; 
use ieee.std_logic_unsigned.all; 
1 0 use ieee.numeric_std.all; 

entity preloop_common_datapath is 
port( 

— inputs 

15 ip_l, ip_2, ip_3, ip_4, ip_5, ip_6, ip_7, ip_8, ip_9, ip_10, ip_l 1 :in std_logic_vector(15 
downto 0); 

— constant inputs 

constant_l, constant_2, constant_3, constant_4, constant_5, constant_6, constant_7, 
20 const ant__8, constant_9, constant_10, constant_ll, constant_12, constant_13 5 constant_14, 
constant_15, constant ! 6, 

constant_17, constant_18, constant_19, constant_20, constant_21, constant_22 : in 
std_logic_vector(15 downto 0); 

25 - 2 input mux select lines 

sel_l, sel_2, sel_ll, sel_12, sel_21, sel_22, sel_23, sel_24, sel_25, sel_26, 
sel_27, sel_28, sel_29, sel_30 : in stdjogic; 

— 3 input mux select lines 

30 sel_3, sel_4, sel_5, sel_6, sel_7, sel_8, sel_9, sel_10, sel_13, sel_14, sel_15, 
sel_16, sel_17, sel_18, sel_19, sel_20 : in std_logic_vector(l downto 0); 

— enable signals for tri-state buffers at output of muxs 

en_l, en_2, en_3, en_4, en_5, en_6, en_7, en_8, en_9, en_10, en_ll, en_12, en_13, en_14, 
35 en_15, 

en_16, en_17, en_18, en_19, en_20, en_21, en_22, en_23 s en_24, en_25, en_26, en_27, 
en_2&- 

en_29, en_30 : in std_logic; 

40 ~ output signals 

°P_1> op_2, op_3, op_4, op_5, op_6 : out std_logic_vector(15 downto 0); 

elk : in std_logic ; 

45 rst :in std_logic 



5 



combined preloop basic blocks of Affine and Perspective follows. 

preloop_common.vhd 
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architecture arch_preloop_common_datapath of preloop_common_datapath is 

component xcv2_multl6xl6s is 

Port( 

a : in std_logic_yector(15 downto 0); 
5 b : in std_logic_vector(15 downto 0); 
elk : in std__logic; 

prod : out std_logic__vector(31 downto 0) 

); 

end component; 

10 — these muxs are those controlling inputs to adders and multipliers 

signal mux_lout, mux_2out, mux_3out, mux_4out, mux_5out, mux_6out : std_logic_vector( 
15 downto 0); 

signal mux_7out, mux__8out, mux__9out, mux_10out, mux_llout, mux_12out: 
std_logic_vector( '15 downto 0); 
15 signal mux__13out, mux_14out, mux_15out, mux_16out, mux_17out, 
mux_18out:std_logic_vector( 15 downto 0); 
signal mux_19out, mux_20out : std_logic_vector( 15 downto 0); 

— these muxs are those controlling register delay paths that differentiate 
20 — affine and perspective transform configurations 

signal mux_21out, mux_22out, muxJ23out, muxJ24out, mux_25out, mux_26out, 
mux_27out, mux_28out, mux_29out, mux_30out : std_logic_yector(15 downto 0); 

— these signals capture the 32 bit outputs from multipliers and are 
25 - fed to filters that remove the 3 1 - 1 6 MSBs 

signal temp_l, temp_2, temp_3, temp_4, temp_5, temp_6, tempJ7, temp_8, temp_9, 
temp_10: std_logic_vector(31 downto 0); 

— these signals get the 16 bit outputs from the temp signals and feed to register inputs 

30 signal input_reg_l, input_reg^l2, input_reg_14, input_reg__19, input_reg_25, input_reg_28, 
input_reg^39, input_reg_41, input_reg_6, input_reg_33, input_reg^_20, inputjreg_15, 
input_reg^_26, inputjreg_29, input_regJ22 : std_logic_vector(15 downto 0); 

— these signals are the outputs of tri_state buffers present after the muxs 
35 — which control the exit points of the adjusted delayed paths 

signal tri_state21, tri__state22, tri_state23, tri_state24, tri_state25, tri_state26, tri_state27, 
tri_state28, tri_state29, tri_state30 : stdJogic_vector(15 downto 0); 

signal reg^l, reg_2, reg_3, reg_4, reg_J, reg_6, regJ7, reg_8, reg_9, reg_10, 
40 reg_12, reg_14, reg^lS, reg_19, reg_20, 

reg^22, reg_23, re&_24, reg_25, reg^26, reg_28, reg_29, regJ33, 
reg^.34, reg^35, reg^_36, reg^37, reg_39, 

regj41, reg_42, reg^43, reg_44, reg_45, reg_46, reg^_47, reg_48, reg^_49, reg_50, 
reg_51, reg__52, reg_53, reg_54, reg_55, reg^56, reg_57, reg^_58, reg^59, reg_60, 
45 reg_61, reg_62, reg_63, reg_64, reg_65, reg_66, reg_67, reg_68, reg_69, reg_70, 
regJ71, reg^72, regJ73, reg_74, reg L _75, reg_76, reg_77, reg_78, reg^.79, reg_80, 
reg_81 : std_logic_vector(15 downto 0); 

begin 

50 
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the following are the multiplexers controlling the inputs to. multipliers 

mux_lout <= reg_20 when sel_l= ! 0 f else tri_state22; 

5 mux 2out <= reg^_24 when sel_2= '0' else constant_2; 

with sel_3 select mux J3 out <= 
ip_3 when "00", 
regLl5 when "01", 
10 tri_state23 when "10", 

(others =>'Z % ) when others; 

with sel_j4 select mux_4out <= 

constants when "00", 
15 reg^24when"01", 

constant_4 when "10", 
(others =>'Z f ) when others; 

with sel_5 select mux_5out <= 
20 ip_4when"00", 

reg^20when"01", 
tri_state24 when "10", 
(others ^'Z') when others; 

25 with sel_6 select mux_6out <= 

constant^ when "00", 
reg_23 when "01", 
constant_6 when "10", 
(others => f Z f ) when others; 

30 

with sel_7 select mux_7out <= 
ip__6 when "00", 
reg_23 when "01", 
tri_state25 when "10", 
35 (others =>'Z % ) when others; 

with sel_8 select mux_8out <== 

constant_7 when "00", 
reg_23 when "01", 
40 constant^ when " 1 0", 

(others => f Z f ) when others; 

with sel_9 select mux__9out <= 
ip_7 when "00", 
45 reg_24when"01", 

tri_state26 when "10", 
(others ^Z 1 ) when others; 

with sel_10 select mux_10out <= 
50 constants when "00", 
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re&_29 when"01", 
constant_10 when "10", 
(others ^Z 9 ) when others; 

5 mux_l 1 out <= reg^_24 when sel_l 1= f 0' else tri_state27; 

mux_12out <= reg^26 when sel_12= '0' else constant^! 1; 



10 — the following are the multiplexers controlling the input to adders 

with sel_13 select mux_13out <= 
reg^l9 when "00", 
ip_10when"01", 
15 tri_state21 when "10", 

(others ^Z 1 ) when others; 

with sel_14 select mux_14out <= 

constant_15 when "00", 
20 constant_16 when "01 ", 

reg^when'^O", 
(others => % Z*) when others; 

with sel_15 select mux_15out <= 
25 reg^l4when"00", 
reg_15 when "01", 
tri_state29 when "10", 
(others ^'Z') when others; 

30 with sel_l 6 select mux_l 6out <= 

constant_17 when "00", 
constant_18 when "01", 
reg_14when"10", 
(others ^Z 1 ) when others; 

35 

with sel_17 select mux_17out <= 
reg_25 when "00", 
ip_ll when "01", 
reg^when"!©", 
40 (others =>'Z f ) when others; 

with sel__18 select mux_18out <= 

constant_19 when "00", 
constant_20 when "01", 
45 tri_state28 when "10", 

(others => r Z') when others; 

with sel_19 select mux_19out <= 
reg_28 when "00", 
50 reg_29 when "01", 
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reg_28 when "10", 
(others => % Z) when others; 

with sel_20 select mux_20out <= 
5 constants 1 when "00", 

constant_22 when "01", 
trij3tate30when"10", 
(others =>'Z*) when others; 

10 — the following are the statements implementing the multipliers 

multp_instl : xcv2_multl6xl6s 

port map ( ip_l, constant_l, elk, temp_l); 

input jreg_l <= temp_l(15 downto 0); 

15 

multp_inst2 : xcv2_multl6xl6s 

port map ( mux_lout, mux_2out, elk, temp_2); 

input_reg_12 <= temp__2(15 downto 0); 

20 multp_inst3 : xcv2jmultl6xl6s 

port map ( mux_3out, mux_4out, elk, temp_3); 
input_reg_14 <= temp_3(15 downto 0); 

multp_inst4 : xcv2_multl6xl6s 
25 port map ( mux_5out, mux_6out, elk, temp_4); 
input reg 19 <= temp_4(15 downto 0); 

multp_inst5 : xcv2_multl6xl6s 
port map ( mux_7out, mux_8out, elk, temp_5); 
30 input jreg_25 <= temp_5(15 downto 0); 

multp_inst6 : xcv2_multl6xl6s 

port map ( mux_9out, mux_10out, elk, temp_6); 

input_reg_28 <= temp_6(15 downto 0); 

35 

multp__inst7 : xcv2_multl6xl6s 
port map ( mux_l lout, mux_12out, elk, temp J7); 
• input_reg_39 <= tempJ7(15 downto 0); 

40 multp_inst8 : xcv2_multl6xl6s 

port map ( ip_9, constant__12, elk, temp_8); 
input jreg_41 <= temp_8(15 downto 0); 

multp_Jnst9 : xcv2_multl6xl6s 
45 port map ( ip_2, constant_13, elk, temp_9); 
input_reg_6 <= temp_9(15 downto 0); 

multp_instl0 : xcv2_multl6xl6s 
port map ( ip_8, constant_14, elk, temp_10); 
50 input_reg_J33 <= temp_l 0(1 5 downto 0); 
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— the following are the statements implementing the adders 
5 input _reg_20 <= mux_13out + mux_14out; 

input r eg 15 <= mux_15out + mux_16out; 
inputjreg_26 <= mux_17out + mux_18out; 

10 

input_reg_29 <= mux_19out + mux_20out; 

— the following are the statements implementing the divide / shifter 

1 5 --input jregL22 <= ip_5 and "00 11111111111111";- performing srl by 2 

inputjreg_22 <= "00" & ip_5(15 downto 2); -SRL 3 ; - performing srl by 2 

— the following are the statements implementing register transfers 

— sel line here being T represents state machine for Perspective Transform 

20 ' - enable line of the tristate buffers here is 'l* when either Affine or Perspective State machine 

— selects the associated mux. 

mux_21out <= reg_l when sel_21= T else reg^5; 

tri_state21 <= mux_21out when en__21 = T else (others => 'Z'); 

25 

mux_22out <= reg_12 when sel__22 = T else reg_51; 
tri_state22 <= mux_22out when en_22 = T else (others => 'Z'); 

mux_23out <= reg_14 when sel_23 = '1' else reg^57; 
30 tri_state23 <= mux_23out when en_23 = T else (others => 'Z'); 

mux_24out <== reg_19 when sel_24 = T else regj>3; 
tri_state24 <= mux_24out when en_24 = T else (others => ! Z f ); 

35 mux_25out <= regJ25 when sel_J25 = T else reg_69; 

tri_state25 <= mux_25out when en_25 = T else (others => 'Z'); 

mux_26out <~ reg_28 when sel_26 — T else regJ75; 
tri_state26 <= mux_26out when en_J26 = T else (others => 'Z 1 ); 

40 

mux_27out <= regJ39 when sel_27 = T else reg^81; 
tri_state27 <= muxJ27out when en_27 = '1' else (others => 'Z'); ' 

mux_28out <= reg_41 when sel_28 = f 0 f else reg_45; 
45 tri_state28 <= mux_28out when en__28 = T else (others => 'Z'); 

mux_29out <= reg_6 when sel_29 = f 0 f else reg__10; 
tri_state29 <= mux_29out when en_29 = '1' else (others => % Zy, 

50 mux_30out <= reg_33 when sel_30 = '0' else reg_37; 
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tri_state30 <= mux_30out when en_30 = T else (others => 'Z'); 
reg_pr :process (elk 

,rst^g^80,input_reg^l,reg_l,reg_2,reg_3^^^ 
reg_54, 

reg_47^eg_48,reg_49,reg_50,input_reg_14,reg_14,reg_55,reg_56,input_reg_19, 

reg_19^eg_58,reg_59,reg_60,reg_61,reg_62,input_reg_25^eg_25,reg_64, 

reg_65,reg_66^eg_67^eg_68,input_reg_28^eg_28^eg_70^:eg_7 1 ,reg_72, 

reg_73,reg_74,input_reg_39,reg_39^eg_76^eg_77^eg_78^reg_79, 

input_reg^41^g_41^eg_42^eg_43,reg_44,input_reg_6,reg_6, 

reg_7,reg_8,reg_9,input_reg_33,reg_33,reg_34,reg_35,reg_36, 

input_reg_l 5,input_reg_20,input_reg_22,input_reg_26,input_reg_29, 

reg_22,reg_23) 

begin 

if(rst='l')then 

reg_K=(others =>'0'); 
reg_2<=(others =>'0') ; 
reg_3<=(others =>*O0 ; 
reg_4<=(others =>'0*) ; 
reg_5<=(others=>'0 , ) ; 
reg_6<=(others ^'O') ; 

reg_7<=(others =>'0'); 
reg_8<=(others ^O") ; 
reg_9<=(others =>*0') ; 
reg_10<=(others =>'0') ; 
reg_12<=(others =>'0') ; 

reg_14<=(others =>'0') ; 
reg_15<=(others =>'0') ; 

reg_19<=(others =>'0*); 
reg_2CK=(others =>'0') ; 
reg_22<=(others =>'0') ; 
reg_23<=(others=>'0') ; 
reg_24<=(others =>'0') ; 

reg_25<=(others =>'0'); 
reg_26<=(others =>'0*) ; 
reg_28<=(others =>'0*) ; 
reg_29<=(others=> l 0 , ) ; 
reg_,33<=(others =>'0') ; 
reg_34<=(others =>'0') ; 
reg_35<=(others=>'0') ; 
reg_36<=(others =>'0') ; 

reg_37<=(others =>'0'); 
reg_39<=(others =>'0') ; 
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reg^KKothers^'O') ; 
reg_42<=(others =>'0') ; 

reg_43<=(others =>'0'); 
reg_44<=<others ^'O 1 ) ; 
reg L _45<=(others =>'0') ; 
reg_46<=(others =>'O0 ; 
reg^^others^'O') ; 
reg_48<=(others =>'0') ; 

reg_49<=(others =>'0'); 
reg_50<=(others =>'0*) ; 
reg_5K=(others =>'0^) ; 
reg_52<=(others =>'0') ; 
reg.SS^others^'O') ; 

reg_54<=(others =>'0*) ; 

reg_55<=(others =>'0*); 
reg_56<=<others =>'0') ; 
reg_57<=(others =>'0') ; 
reg_58<=(others =>'0') ; 
reg_59<=(others=>'0') ; 
reg_60<=(others =>'0') ; 

reg_6K=(others => , 0'); 
reg_62<=(others =>'0') ; 
reg_63<=(others =>'0') ; 
reg_64<=(others =>'0') ; 
regjSS^others^'O') ; 
reg_66<=(others =>'0') ; 

reg_67<=(others =>'0'); 
reg_68<=(others =>'0') ; 
reg_69<=(others =>*0') ; 
reg_70<=(others =>'0') ; 
reg_7K=(others=>'0') ; 
reg_72<=(others =>'0') ; 
reg_73<=(others =>'0') ; 

reg_74<=(others =>'0'); 
reg_75<=(others =>'0') ; 
reg_76<=(others =>'0') ; 
reg_77<=(others =>'0') ; 
regJ/S^others^'O') ; 
reg_79<=(others =>'0') ; 

reg_80<=(others =>'0*); 
reg_8K=(others =>'0') ; 
elsif (rising_edge(clk))then 
reg_l <= input_reg_l ; 
reg_2<=reg_l; 

reg_3<=reg_2; 

reg_4 <= reg_3; 

reg_5 <= reg_4; 

reg_12 <= input_reg^l2; 

reg_46<=re&_12; 

reg_47<=reg_46; 
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reg_48 <= reg^_47; 
reg_49 <= reg_48; 
reg_50 <= reg_49; 
reg^Sl <=reg_50; 
reg_14 <= input_reg_14; 
reg^_52 <= re&_14; 
reg_53 <= reg^52; 
reg_54 <= reg_53; 
reg_55 <= reg^54; 
reg_56 <=reg_55; 
reg_57 <= reg_J6; 
reg^l9 <= inputjreg_19; 
reg_58 <=reg__19; 
reg^59 <= reg_58; 
reg_60 <= reg_59; 
reg_61 <= reg_60; 
reg_62 <=reg_61; 
reg^_63 <= reg_62; 
reg_25 <= input_reg_25; 
reg_64 <= reg_25; 
reg_65 <== reg_64; 
reg_66 <=reg^65; 
regj>7 <= reg_66; 
reg_68 <= reg_67; 
reg_69 <= reg__68; 
reg^28 <= input_reg^28; 
regJ70 <== reg_28; 
reg_71 <=regJ70; 
regJ72<=reg_71; 
reg_73 <= reg_72; 
regJ74 <= reg_73; 
regJ75 <= regJ74; 
re&_39 <= input_reg_39; 
reg^76<=regJ39; 
reg_77 <= reg^76; 
regJ78 <= reg^77; 
reg_79 <= reg_78; 
reg_80<=reg_79; 
reg^81 <= reg_80; 
reg_41 <=input_reg_41; 
reg_42 <=reg_41; 
reg^43 <= reg__42; 
reg_44 <= reg_43; 
reg_45 <= reg_44; 
reg_6 <= inputjreg_6; 
regJ7 <= regjS; 
reg_8 <= reg_7; 
reg_9<=reg^8; 
reg_10<=reg_9; 
re &_33 <= input _re&_33; 
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regJ34 <= reg_33; 
reg^35 <= reg_34; 
reg_36 <= reg_35; 
reg^_37 <= regJ36; 
5 reg_20 <= input_reg_20; 

reg^lS <= input_reg^_15; 
reg_26 <= input_reg_26; 
reg^29 <= input_re&_29; 
reg^22 <= inputjreg_22; 
10 reg_23 <= reg__22; 

reg^24<=reg_23; 

end if ; 
end process regjpr; 



op_3 <=reg l _19; 
op__4<=reg_25; 
op_l <= reg_20; 
opJ2<=reg^l5; 
20 op_6 <= reg_26; 
op_5 <= reg_29; 

end architecture; 

Returning to LCSG development, in the following approaches, an exemplary 
25 preferred embodiment of the invention starts with CDFGs representing the entire application 
and which have been subjected to zone identification, parallelization and loop unrolling. The 
zones / Control Points Embedded Zones (CPEZ) that can be suitable candidates for 
reconfiguration will be tested for configurable components through the following approaches. 
Note: Each Zone / CPEZ will be represented as a graph. 
30 Proposed Approach 
Seed selection: 

This approach is to find seed basic blocks and proceed on the CFG to grow these 
seeds. Note that all basic blocks which have outgoing edges whose destination basic block's 
first instruction line number is less than or equal to the line number of the first instruction of 
35 the source basic block, then those outgoing edges are loop back edges. 

For example, if, in Fig. 7, basic block Y's first instruction line number (as extracted 
from the *.ir.c file) is <= equivalent line numbers of basic blocks X or Y, then that edge is a 
loop-back edge (ey. x ) and BBx will be the start of the loop and BBy will be the seed. Since 
C/C++ are sequential languages the Lance compiler does not build loop in any other manner 
40 that is erroneous. 
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In this approach, the seed is a basic block that lies inside a loop because theloop is 
done over and over. This process can result in 3 types of loops: 

(i) A single nested level loop with only 1 basic block as shown in Fig- 8, 

(ii) A single nested level loop with > 1 basic block as shown in Figs. 9 (a) and (b), Z 
5 is not considered a loop in Fig. 9 (a), and 

(iii) Multi-level nested loop as shown in Fig. 10. 

To proceed further we will consider as seeds only basic blocks of class X as in types 
(ii) and (iii) are considered as seeds. This step is a simple construct to start off and yet allows 
the growth of the constructs to include multiple level nested loops, without one growing 
10 construct overlapping another growing construct/cluster. 

The next step is to identify all basic blocks that come under the control umbrella of X 
and Y. All such basic blocks he between the linked list entries of V i.e. G(E,V) of X and Y. 
These blocks are classified into 3 categories (i) Decision (ii) Merge (iii) Pass as shown for 
example in Fig. 11. 

15 The same block might be included in both Decision and Merge classes. Therefore 

the number of blocks in this umbrella under (a, j) <= (Decision + Merge + Pass). This feature 
vector is one of the vectors used to quickly estimate the similarity of clusters. 

Another feature vector will be the vector of operation type count for blocks in the 
Decision, Merge and Pass classes. 
20 Example: 

Merge (c, e,j) + * V / 

c= 5 3 2 1 

e= 2 0 1 0 

j= 0 3 0 0 

25 Total = (7,6,3, ....,1) 

These steps should be used to form candidate clusters from the CFG that can be 
classified as similar / reconfigurable. This result could vary based on programmer's skill. 
Highly skilled programmers could lead to faster grouping because of encompassing repeated 
versions of a complex construct into a function and perform repeated function calls. 
30 Finer comparisons for performing the extraction of the largest common sub-graph, is 

carried out on this group. 

Identifying the Largest Common Sub-graph or Common set of Sub-graphs between 
two candidate Data Flow Graphs representing a Basic Block each. 
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Each edge in a DFG is represented by a pair of nodes (Source and 
Destination). Each node represents an operation such as add (+), multiply (*), divide (/) etc. 
All the edges are represented by a doubly linked list as part of the graph representation 
G(V,E). These edges are now sorted based on the following criteria into several bins. 

The criteria for sorting is based on the fact that an edge consists of two basic 
elements (Source Operation, Destination Operation). In the example shown, source operation 
'a* has a lower rank than 'b* and *c\ If the SO of the edges are the same, then their DO are 
compared. The same rule applies: the DO with the lower rank, is placed to the left. In this 
manner, the string is sorted. Say for example a sorted string is: 

aa 9 aa, ac, ba, ba, bb 9 be, cb, cc 

Now these pairs of alphabets will be placed into bins. In order to place them 
the first or the left most pair (aa in our example) is assumed to be the head of 
the queue. It is placed in the first bin. Then all the following elements in the 
queue are compared with the head, till a mismatch is obtained. If a match 
occurs then, that pair is placed in the same bin as the head. Now the first 
mismatched pair is designated as new head of the queue. This is now placed in 
a new bin and the process is followed till all elements are in a set of bins as 
shown in the following Figure 12. 
The next step is to perform a similar but not exactly the same process for the graph 
that needs to be compared with the candidate graph, graph number 1. Consider a second 
graph, graph number 2 as shown in Figure 13. (In Graph 2 flow is left to right rather than top 
to bottom.) 

This graph is converted to a string format in the same manner as graph #1 and this 
string, as shown below needs to be placed into a new set of bins, 
aa, ab, ab, ba, ba, bb, bb, be, cb, cc 

This is done by assigning the leftmost element in the queue to be the head. It is first 
compared to the element type in the first bin of the old set(aa) [This is termed as the reference 
bin]. If it checks to be the same, then the first bin of the new set is created and all elements 
upto the first mismatch are placed in this bin. Then the reference bin is termed as checked. 
Now the new head type is compared to the first unchecked bin of the reference set. If there is 
a mismatch, then the comparison is done with the next unchecked bin and so on, until the SO 
of the element type is different from the SO of the element type in the reference bin. At this 
point, a comparison of all successive element pairs in the current queue are compared with 
the head, till a mismatch is met. Then the matched elements are eliminated. 
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But in case, a match is found between the head of queue and a reference bin, then a 
new bin in the current set is created and suitably populated. The corresponding reference bin 
is checked and all previously / predecessor unchecked reference set bins are eliminated. 

By this approach, we are eliminating comparison between unnecessary edges in the 
5 graphs. Now a new set of bins for graph 2 is obtained as shown (Fig. 13 (a)). 

Thus the edges in a Data Flow Graph, representing a Basic Block, are arranged into 
bins as described above. Only note that when it said that a bin should be eliminated 
if it's corresponding type is not found in the previous pair, then what is meant is that 
the bin should be marked for elimination. . Thus one will have a pair of bin 
10 sequences, in which some bins might have been marked as 'eliminated' type. 

Consider any such bin and track all edges connected to edges in that bin. If any of 
these connected edges are isolated edges (i.e. all their connected edges => 
predecessors + siblings + companions + successors are marked as 'eliminated' type) 
then mark them as 'eliminated' type. This is illustrated in Fig. 14. 
1 5 Now for all the remaining 'un-eliminated' edges, quadruple associativity 

information is obtained (Predecessor, Siblings, Companions, and Successors). At this 
point measure the associativity counts for all edges in a bin pair. 

For example, if we have 3 bins in each graph, say Add-Divide, Divide-Multiply and 
Add-Multiply, then redistribute edges in each bin of each graph, into the corresponding 
20 associativity columns. This will result in the tables (called Associativity-Bin matrix) shown 
below, where 'x' represents edges belonging to a particular associativity number in a bin. 
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Associativity G2 
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+* 
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The following pseudo code in C describes the matching or the discovery of the largest 
25 common sub-graph or sets of common subgraphs between the two candidate DAGs using the 
Associativity-Bin Matrices. 
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**************************p seuc j 0 q code 
begin******** ******************** * 

************************ * *comment 
begin* ********************************* 
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that 



Given 2 sorted Directed Acyclic Graphs Gl and G2 the matrix form such 

height of both matrixes = height, and 
width of graph 1 = width_Gl 
width of graph 2 = width_J32 



As an example, 



10 



Graphl 



Graph2 



15 



20 



h 
e 
i 

g 

h 
t 



+/ 



Associativity Count 
2 11 



x 

X 



+/ 



width of Graphl 



width of Graph2 



25 



here x marks those row, column intersections where edges of the graph are 
distributed into and an x represents a Primary Group of Edges (PGE) or 
Secondary Group of Edges (SGE) 
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************************** comme nt 
end* ********************************* 



35 



main() 
{ 



initialize i = height; 
initialize k = width G2; 



40 



45 



50 



55 



for (j 
{ 



label : 



width_G2; j<= 1 OR Gl (i, j ) ==Null; j--) 

for (i = height; i<= 1 OR Gl (i, j ) ==Null; i--) 
{ 

while (G2 (i # k)==*Tull) 
{ 

k++; 

if (k>width_G2) 

exit and goto L0C_1; 

} 

/* function call*/ 

compare (Gl (i, j) .edges, G2 (i,k) .edges) ; 
reset value of k to width_j32 ; 

LOC_l 
} 

reset value of i to height; 



void compare (group_of_edgesl, group_of_edges2) 
60 { 

if (group_of_edgesl.#of_edges > group_of_edges2 .#of_edges) 
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group__of_edgesl is Primary_Group_of_Edges or PGE; 
group_of_edges2 is Secondary_Group_pf JEdges of SGE; 



the other way around; 



************************* *commen t 
10 begin********************************** 

Assuming that a group of edges (PGE / SGE) is arranged in the data 
structure 

that looks like this : 
15 Here a, g # etc... are Nodes. 

and a-g, a-k , etc... are Edges. 



Edges of type div2mul 
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Note that edges in each slot are divided into 2 baskets: 
1) uncovered basket 
25 2) covered basket 

Initially when the graph comparison begins all Associated Edges 
(Predecessors, Siblings, Companions, Successors) in all slots will be in 
the respective uncovered baskets. 
30 But as we begin covering edges, those Associated Edges will start filling 

their respective covered baskets 1 ! 

For reasons of simplicity the above example assumes all the Associated 
Edges are in their respective uncovered baskets. 
************************* *comment 
35 end********************************** 



/ * outer for loop */ 

for (prow = 1; prow PGE.#of edges; prow++) 
40 { 

/* inner for loop */ 

for(srow = 1; srow <= SGE . #of _edges ; srow++) 
/* function call*/ 

45 Result = Test_for_compatibility (PGE (prow) , SGE (srow) ) / 

if (Result -o fail) 
{ 

prow - - ; 

} 
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else /* if Result == pass */ 
{ 

/* function call */ 

cover (PGE (prow) , SGE (srow) ) ; 

exit(l); /* this should exit the inner for loop and 
continue with the outer for loop */ 

} 

} 

/* inner for loop */ 

} 

/* outer for loop */ 
return () ,- 

} 



int Test_f or_compatibility (PGE (prow) , SGE (srow) ) 

if (PGE (prow) . candidate_edge . covered__f lag == 
SGE (srow) . candidate_edge . covered_f lag) 



{ 



if (PGE (prow) . candidate_edge . Source_node . touched_f lag == 
SGE (srow) . candidate_edge . Source^node . touched_f lag) 

if (PGE (prow) . candidate_edge . Destinat ion_jnode . touched_f lag 

se 

SGE ( srow) . candidate_edge . Destination_node . touched_f lag) 



{ 



SGE (srow) .covered count) 



if (PGE (prow) . cover ed_count 



{ 



for (column 

{ 



1; column <= 4; column++) 



for (slot = 1/ slot <=s3 AND 

PGE (prow, column, slot) 1= null AND 

SGE (srow, column, slot) != null; slot++) 

{ 

if (PGE (prow, column, slot) . covered_ 
count == 

SGE (srow, column, slot) . covered_cou 
nt) 



{ 



} 

else 



return pass; 
/* this indicates a 
potential for covering to 
be peformed*/ 



return fail; 



} 

else 



} 

else 



return fail; 



return fail; 



} 

else 



return fail; 
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} 

else 

return fail; 

} 

5 

void cover (PGE (prow) , SGE (srow)) 
{ 

if (PGE (prow) . candidate_edge . covered_f lag ! = 1) 

10 PGE (prow) . candidate_edge.covered_f lag = 1; 

SGE (srow) . candidate__edge . covered_f lag » 1 ; 

update_f lags_and_counts (PGE (prow) . Candida te_edge, 
SGE (srow) .candidate edge) ; 

15 } 

for (column = 1; column <= 4, column ++) 
{ 

for (slot = 1; slot <=3 AND PGE (prow, column, slot) != null AND 
20 SGE (srow, column, slot) != null AND 

PGE (prow, column, slot) .uncover ed_count != null AND 

SGE (srow, column, slot) .uncovered count 1= null; slot++) 

{ 

/* outer for loop */ 
25 for(pedge = 1; pedge <= 

PGE (prow, column, slot) . uncovered_count ; pedge++) 

/* inner for loop */ 
for (sedge = 1; sedge <= 
30 SGE (srow, column, slot) . uncovered_count ; sedge++) 

if (PGE (prow, column, slot , uncovered_basket [pedg 
e] ) . Source__node . touched_f lag == 
SGE (srow, column, slot , uncovered_basket [sedge] ) 
35 . Source_node . touched_f lag 

AND 

PGE (prow, column, slot , uncover ed_basket [pedge] ) 
. Destination_node . touched_f lag == 
SGE (srow, column, slot , uncovered__basket [sedge] ) 
40 .Destination node. touched flag) 

{ 

push_this_edge_into_covered_basket 
(PGE (prow, column, slot , uncovered_basket [ 
pedge] ) , 

45 SGE (srow, column, slot , uncover ed_basket [s 

edge] ) ) ; 

upda t e_f 1 ags_and_coun t s 

(PGE (prow, column, slot , uncovered_basket [ 
pedge] ) , 

50 SGE (srow, column, slot , uncover ed_b as ket [s 

edge] ) ) ; 
exit (1); 

/* this should exit the inner for loop 
and continue with the outer for loop */ 

55 } 

} 

/*inner for loop */ 

} 

/* outer for loop */ 

60 } 
} 
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return () ; 

} 

void push_this_edge into_covered basket (pedge, sedge) 
5 { 

/* this does a transfer of the covered edge from the uncovered basket 
of a slot to the covered basket of a slot */ 

} 

10 void update_f lags_ and_counts (edge_f rorn_PGE, edge from__SGE) 

{ 

/* this does an update on all covered flags of edges 
and on all touched flags of nodes 
and on covered and uncovered counts of all slots 
15 and the total count for candidate edges 

*/ 

} 

**************************p seu 3 0 q code 
20 end* **************************** 

The complexity of this algorithm is estimated to be of the order O (N 5 ), where N 
represents the number of edges in the smaller of the 2 candidate graphs. 

Although this complexity is high, yet when compared to the O O^N 4 ) complexity 
25 algorithm proposed by Cicirello at Drexel University, the differences are: 

a. Cicirello's algorithm delivers a large enough common sub-graph, which is 
an approximate result. 

b. The proposed algorithm not only derives the largest common sub-graph or 
a large-common sub-graph but also potentially derives other common-sub- 

30 graphs. All such common sub-graphs result in potential savings when 

implemented as an ASIC computation unit. 

c. Cicirello's algorithms relies on a random number of attempts (P) to start 
the initial mapping. In the worst case, if all possible mappings are tried, 
then the solution becomes exponential. 

35 Therefore after subjecting the CFG to the above set of processes, 2 types of entities 

are obtained: (i) Basic Blocks with Large Common Sub-graphs & (ii) Basic Blocks without 
any common sub-graphs. For the purpose of scheduling, Basic Blocks that share common 
sub-graphs will be termed as 'Processes' or nodes in the CFGs that share resources. 

As an example 2 DAGs (affine and perspective preloop) were analyzed for common 

40 sub-graphs. The common sub-graph obtained is that shown in the Fig. 5. 

Architectures of Common Sub-graphs: 

For a common-sub-graph, an ASAP schedule is performed. Although many other 
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types of scheduling are possible, in this research effort the focus is placed primarily on 
extracting maximal parallelism and hence speeds of execution. The earliest start times of 
individual nodes, are determined by the constraint imposed by the ASAP schedule of the 
parent graph in which the common sub-graph is being embedded / extracted. 
5 Since the schedule depends on the parent graph, the same sub-graph has different 

schedules based on the parent graph (affine transform preloop DAG / perspective transform 
preloop DAG). In order to derive a single architecture that can be used with minimal changes 
in both instantiations of the common sub-graph, the sharing of resources is performed based 
on the instance that requires the larger number of resources. This policy is applied to each 
10 resource type, individually. For example, the sharing of multiplier nodes in instance 1 (affine) 
can be formed as: 

e|j,b,c|v,g,h|Al, A5,A6IA3,A7,A8|y,k,l|n,o,p|r 
and the sharing of multiplier nodes in instance 2 (perspective) can be formed as: 
e|b,c|v,g,h|Al, A5,A6|A3,A7,A8|y,k,l|o,p|r|j|n| 
15 Since the instance 2, requires a greater number of resources, the resource sharing in 

instance 1 is modified to match that of instance 2. 

The same process is followed for the adder nodes and a common sharing is obtained: 
A2, f, d | u, t, i | A4, s, q | x, w, m | ' 

Implementing an architecture for each instance with the common resource sharing 

20 distribution results in 2 similar architectures (shown in figures below), which differ 

in the number of delays present on certain paths. 

This problem is overcome by adding multiplexers along paths that have different delays 
while connecting the same source and destination(s). This is shown in figure below. 

25 In this research effort, the common architectures are implemented as ASICS in 

VHDL. The regions of the DAGs that are not covered by common architectures are left for 
generic LUT style implementation. For the above example of complex warping applications, 
we have synthesized the common architectures and obtained gate counts based on Xilin's 
estimates using the Xilinx Synthesis Tool. We have further translated this architecture onto 

30 LUTs on a Xilinx Spartan 2E FPGA. Based on well accepted procedures gate count and bit 
stream estimates for the translated architecture have been obtained [refer Trenz Electronic 
paper]. These results show the potential savings that can be achieved in 2 modes of 
implementation: (i) A completely LUT based architecture with flexible partial 
reconfigurability and (ii) An ASIC - LUT based architecture. In type (i) the savings are 

35 expressed in terms of time taken to perform the redundant reconfiguration (assuming that the 
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configuration is performed at the peak possible level of 8 bits in parallel at 50 MHz), over 
one run / execution of the preloop basic block and over an expected run of 30 iterations per 
second (since there are 30 frames per second of video, and the preloop basic block is 
executed for every frame). In type (ii) the savings are expressed in terms of number of gates 
5 required to represent the architecture in an ASIC versus the number of gates required to 
represent the architecture in an LUT format of the Spartan 2E processor. 

ha both types, significant savings are obtained. 
Scheduling 

Once the number of processing units has been chosen, the CDFGs have to be mapped 

10 onto these units. This involves scheduling, i.e. allocating of tasks to the processing units in 
order to complete execution of all possible paths in the graphs with the least wastage of 
resources but avoiding conflicts due to data and resource dependencies. 

In the graph matching, one can include branch operations to reduce the number of 
graphs. This can be done, if one of the paths of a branch operation leads to a very large graph 

15 compared to the other path, or is a subset of the other path. This still leaves us with the 

problem of conditional task scheduling with loops involved. Since scheduling is applicable to 
many diverse areas of research, in this section all the work done in scheduling is not 
discussed. Instead this focuses on those that are relevant to mapping data flow graphs on 
processors, proposes a method most suitable for the purpose of reconfiguration, and compares 

20 it with contemporary methods. Several researchers have addressed task scheduling and one 
group has also addressed loop scheduling with conditional tasks [57]. A detailed survey of 
data and control dominated scheduling approaches can be found in [58], [59] and [60]. Jha 
[57] addresses scheduling of loops with conditional paths inside them. This is a good 
approach as it exploits parallelism to a large extent and uses loop unrolling. But the drawback 

25 is that the control mechanism for having knowledge of 'which iteration's data is being 

processed by which resource" is very complicated. This is useful for one or two levels of loop 
unrolling. It is quite useful where the processing units can afford to communicate quite often 
with each other and the scheduler. In the present case, the network occupies about 70% of the 
chip area [1] and hence cannot afford to communicate with each other too often. Moreover 

30 the granularity level of operation between processing elements is beyond a basic block level 
and hence this method is not practical. And within a processing element, since the 
reconfiguration distance (edit distance) is more important, fine scale scheduling is 
compromised because the benefits with the use of very fine grain processing units is lost due 
to high configuration load time. [68] paper discusses a 'path based edge activation' scheme. 
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This basically means, if for a group of nodes (which must be scheduled onto the same 
processing unit and whose schedules are affected by branch paths occurring at a later stage) 
one knows ahead of time the branch controlling values, then one can at run time prepare all 
possible optimized list schedules for every possible set of branch controller values. In the 
following simple example shown in Fig. 15, the nodes in gray need to be scheduled on the 
same processing unit. The branch controlling variable is b which can take values of 0 or 1 . In 
case it takes a 0, the branch path in red is taken, else the path in green is taken. In the case 
where one can know at run time, yet ahead of time of occurrence of the branch paths, the 
value of V, one can prepare schedules for the 3 grey nodes and launch either one, the 
moment b's value is known. 

This method is very similar to the partial critical path based method proposed by [69]. 
It involves the use of a hardware scheduler and is quite well suited for our application. But 
one needs to add another constraint to the scheduling: the amount of reconfiguration or the 
edit distance. In [69] the authors tackles control task scheduling in 2 ways. The first is partial 
critical path based scheduling, which is discussed above. Although they do not assume that 
the value of the conditional controller is known prior to the evaluation of the branch 
operation. They also propose the use of a branch and bound technique for finding a schedule 
for every possible branch outcome. This is quite exhaustive, but it provides an optimal 
schedule. Once all possible schedules have been obtained, the schedules are merged. The 
advantages are that it is optimal, but its has the drawback of being quite complex. It also does 
not consider loop structures. Other papers that discuss scheduling onto multiprocessor 
systems include [70], [71] and [72]. Among other works carried out on static scheduling by 
([73] and [74]) involve linearization of the data flow graphs. Some others have also taken 
fuzzy approaches [75] and [76]. 
Proposed approach 

Given a control-data flow graph, one needs to arrive at an optimal schedule for the 
entire device. A method is provided to obtain near optimal schedules. This involves a brief 
discussion of the PCP scheduling strategy followed by an enhancement to the current 
approach to arrive at a more optimal schedule. In addition the scheduling involves 
reconfiguration time as additional edges in the CDFG. Ways to handle loops embedded with 
mutually exclusive paths and loops with unknown execution cycles are dealt with as well. 

A directed cyclic graph developed by the Lance compiler 101 from source code has 
been used to model the entire application. It is a polar graph with both source and sink nodes. 
The graph can be denoted by G (V, E). V is the list of all processes that need to be scheduled. 
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E is the list of all possible interactions between the processes. The processes can be of three 
types: Data, communication and reconfiguration. The edges can be of three types: 
unconditional, conditional and reconfiguration. A simple example with no reconfiguration 
and no loops is shown in Fig. 13X. 

In the graph of Fig. 13X, each of the circles represents a process. Sufficient resources 
are assumed for communication purposes. All the processes have an execution time 
associated with them, which has been shown alongside each circle. If any process is a 
control-based process, then the various values to which the condition evaluates are shown on 
the edges emanating from that process circle (e.g. Pll evaluates to D, or D . The method 
may be summarized as follows: 

i. Use known Partial Critical Path (PCP) scheduling to determine the delays for each 
possible path of the CDFG and arrange the list of paths in descending order of the 
delays. 

ii. ' Perform branch and bound based scheduling (which need not be done for every 

path to reduce the complexity). 

iii. Once the final list of all schedules is ready, merge all the schedules by respecting 
data and resource dependencies. 

This example demonstrates the initialization strategy. It describes how the CDFG is split into 
individual DFGs. Moreover, it also shows the various fields required for each node and edge. 
For the CDFG of Fig. 13X, initialization of CDFG data structure and Branching tree proceeds 
as follows: 

Varjndices: var[0] « D; var[l] = C; var[2] = K; 
Assume number of processing elements of type = 1 

Branching tree paths: DCK, DCF, DCK, DC K, Z> CK, DCK, D CK, DCK 
Branching tree paths not possible: D CK, DCK , DCK, DCK 
Removing K we get: DC, D C 

Final Branching tree paths: DCK, DC K , D CK, DCZ, DC, DC. 

Tables XX and YY are the node and edge lists, respectively, for the CDFG of Fig. 
13X. Figs. 14X - 19X are the individual Data Flow Graphs (DGSs) of the CDFG of Fig. 
13X. 

Table XX: 

Node list for the CDFG 
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Table YY: 

Edge list for the CDFG: 



Edgejndex 


parent node_id 


child_node_id 


1 


1 


2 


2 


1 


3 


3 


2 


4 


4 


2 


5 


5 


2 


6 


6 


3 


6 


7 


4 


5 


8 


4 


7 


9 


6 


8 


10 


6 


9 


11 


7 


10 


12 


8 


10 


13 


9 


10 


14 


11 


12 


15 


11 


13 


16 


3 


14 


17 


12 


14 


18 


12 


15 


19 


12 


16 


20 


13 


17 


21 


14 


17 


22 


15 


17 


23 


16 


17 


PCP scheduling is a modified list-based ! 



is_control variable_index 
~ 0 
0 

1 1 
1 1 
1 1 

0 
0 
0 
0 
0 
0 
0 
0 

1 o 
1 0 
0 

1 2 

1 2 

0 

0 

0 

0 

0 



partial Critical Path based scheduling algorithm is that if, as shown in Fig. 20X, Processing 
Elements Pa, Pb, Px, Py are all to be mapped onto the same resource say Processor Type 1 . 
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P A and Pb are in the ready list and a decision needs to be taken as to which will be scheduled 
first. >sa and Xb are times of execution for processes in the paths of P A and P B respectively, 
but which are not allocated on the Processors of type 1 and also do not share the same type of 
resource. 

If Pa is assigned first, then the longest time of execution is decided by theMax (T A + 
V,T a + T b +>sb). 

If Pb is assigned first, then the longest time of execution is decided by the 

Max (T B + Xb , T B + T A + Xa) 

The best schedule is the minimum of the two quantities. This is called the partial 
critical path method because it focuses on the path time of the processes beyond those in the 
ready list. Therefore if Xa is larger than Xb, a better schedule is obtained if Process A is 
scheduled first. But this does not consider the resource sharing possibility between the 
processes in the path beyond those in the ready list. A simple example (Fig. 21X) shows that 
if T A = 3, T B = 2, Xa = 7, Xb = 5, where in processes in the Xa and Xb sections share the same 
resource, say Processor type 2, then scheduling Process A first gives a time of 15 and 
scheduling B first gives a time of 14. But both the critical path and PCP as proposed by Pop 
suggest scheduling A first. 

The difference is because, if the resource constraint of the post ready list processes is 
considered, the best schedule is a min of 2 max quantities: 

Max (T B , Xa) & Max (T A , Ab). 

Pop [69] uses the heuristic obtained from PCP scheduling to bound the schedules in a 
typical branch and bound algorithm to get to the optimal schedule. But branch and bound 
algorithm is an exponentially complex algorithm in the worst-case. So there is a need for a 
less complex algorithm that can produce near-optimal schedules. From a higher view point of 
scheduling one needs to limit the need for branch and bound scheduling as much as possible. 

Initially, the control variables in the CDFG are extracted. Let cl, c2, ,cn be the 

control variables. Then there will be at most 2 n possible data-flow paths of execution for each 
combination of these control variables from the given CDFG. An ideal aim is to get the 
optimal schedule at compile time for each of these paths. Since the control information is not 
available at compile time, one needs to arrive at an optimal solution for each path with every 
other path in mind. This optimal schedule is arrived at in two stages. First the optimal 
individual schedule for each path is determined. Then each of these optimal schedules is 
modified with the help of other schedules. 
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Stage 1: There are m=2 n possible Data Flow Graphs (DFG's). For each DFG, the PCP 
scheduling is done. Then, the DFG's are ordered in the decreasing order of their total delays. 
An optimal solution can be obtained by doing branch and bound scheduling for each of these 
PCP scheduled DFG's. But branch and bound is a highly complex algorithm with exponential 
complexity. In this case, this complex operation needs to be done 2 n times, where n is the 
number of control variables. This increases the complexity way beyond control. Hence 
branch and bound is done only when it is essential to do so. Then branch and bound 
scheduling is done for DFG1, which has the largest delay. For DFG2, the PCP delay is 
compared with the branch and bound delay of DFGL If the PCP delay is smaller, then the 
PCP scheduling is taken as the optimal schedule for that path. If not, then the branch and 
bound scheduling is done to get the optimal schedule. It is reasonable to do this, as the final 
delay of each DFG after modification is going to be close to the delay of the worst delay path. 
In the same way, the optimal schedule is arrived at for each of the DFG. 

Stage 2\ Once the optimal schedule is arrived at, a schedule table is initialized with the 
processes on the rows and the various combinations of control variables on the column. A 
branching tree is also generated, which shows the various control paths. This contains only 
the control information of the CDFG. There exists a column in the schedule table 
corresponding to each path in this branching tree. The branching tree is shown in Fig. 20X. 
The path corresponding to the maximum delay is taken and the schedule for that 
corresponding path is taken as the template (DCK'). Now the DCK path is taken and the 
schedule is modified according to that of DCK'. This is done for all the paths. The final 
schedule table obtained will be the table that resides on the processor. 
The pseudo code of this process is summarized here. 
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Algorithm: 

Task schedule (G(V,E), CTRL_VARS[N]> PE = {PE1.PE2 PEM}) 

For each combination of CTRL_ VARS do 
{ 

Generate a DFG Gsub(V,E t CTRL_VARS[IJ) which is a sub-graph ofG(V,E). Only the 
nodes and edges in the control flow corresponding to the current combination of 
CTRL_VARS are included in this sub-graph. 

Generate the PCP schedule of Gi. Let the schedule be PCPjschedfl] and the delay be 
PCPJLelayff]. 

} 

Sort PCPjsched and PCPjdelay and Gsub in decreasing order ofPCPjdelayfJJ. 

Generate the Branch and bound schedule for GsubfO], the sub-graph with the worst 
PCPjielay Let the schedule be BBjsched[I=OJ and the delay be BBjlelay[I=?0]. 
Initialize worst Jbbjdelay = BBjielay[0] 

For all the other sub-graphs do 
{ 

if (PCPjielay fIJ < worst Jbbjielay) then 
BB_sched[I] = PCPjsched [I] ; 
BBjIelay[fj = PCPjdelay [I]; 

else 

Generate BB_schedfIJ and BBjlelay[I] ; 
If (BB__delay[IJ > worst Jbbjielayff]) then 
Worstjbbjlelay = BBjdelayflJ; 

} 

Generate the branching tree with the help of the G(V,E). In this branching tree, the edge 
represents the choices (Kand K') and the node represents the variable (K) 
Initialize the current path to the one leading from the top to the leaf in such a way that the 
DFG corresponding to this path gives the worstjbbjlelay. The path is nothing but a list 
of edges tracing from the top node till the leaf 

Processes with large execution times have a greater impact on the schedule than the 
shorter processes. Hence, large processes are scheduled in a special way. The shorter 
processes can be scheduled using the PCP scheduling algorithm. Since PCP scheduling is 
done for most of the processes, the complexity stays closer to 0(N), where N is the number 
of processes to be scheduled. 

a) Identify the first set of processes that need to be scheduled onto the same processor 
which are computationally complex. Let's call them MP1, MP2. . ..(Macro process 1 
etc.) 

b) Schedule all the processes till these macro processes in the data flow graph using PCP 
scheduling. 
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c) Calculate the estimated execution time of the smaller processes to find the start time 
of each of the macro process. 

d) Determine the next set of such macro processes in the DFG. Let's call them MP__sub 1 , 
MP_sub2... 

e) For processes amidst these two sets of macro processes, PCP scheduling is used. 

f) For processes occurring after the second set of macro processes, the execution times 
are added up to get the total execution time. 

g) Now, determine the order of execution of these processes by estimating the worst- 
case execution time in each case and selecting the best amongst them. 

h) After this scheduling, the block after the second set of macro processes is taken as the 
current DFG and steps a-g re implemented. 

i) Step h is repeated till the end of DFG is reached. 
Schedule merging : 

In the schedule table there are some columns representing paths that are complete and 
some that are not. The incomplete paths can be now referred to as parent paths of possible 
complete paths. 

In the example shown in Fig. 13X, for earliest evaluation of all conditional variables 
(viz. D, C, K) it is necessary to evaluate D first, then C and then K. Therefore the tree of 
possible paths is as shown in Fig. 22X. Now, while creating the schedule table, initially only 
considered are the full possible paths i.e. , the 6 paths listed in Fig. 22S. Scheduling is 
performed by the suggested algorithm. This will fill these columns. Then the remaining 
column of partial paths (i.e., D, DC,., .etc) is created. These are now just empty columns. 
Now if a process has the same start times in multiple columns, it is pushed into the parent 
empty column. 

For example, from the Figure 4 of Pop's paper "scheduling of conditional process 
graphs for the synthesis of embedded systems" one sees that processes PI, P2, P6, P9, P10, 
PI 1, Pe and so on have the same time of occurrences in both paths. Therefore one can push 
them into the parent column, of D C because it means that these processes can be scheduled 
for execution (not necessarily executed) by the logic schedule manager after C has been 
evaluated. 

This approach tries to obtain the worst case delay and merge all paths to that timeline. 
Since the DCK path had the worst case optimal delay, aU other full paths were adjusted to 
match this path. But it is also necessary to consider the probability of the occurrence of all the 
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full paths (6 of them). Then preferably the bottom 10% of the paths are pruned out. That is, 
one disregards those full paths whose probability of occurrence is less than a threshold value 
when compared to the path with most probable occurrence. 

Then a path is selected from the remaining ones, whose probability of occurrence is 

5 the highest. This will be the new reference to which all the remaining paths will adjust. Now 
it is likely that these chosen full paths and the disregarded full paths, share certain partial 
paths (parent paths). Therefore, while allocating the start times for the processes that fall 
under these shared partial paths, one must allocate them based on the worst (most delay 
consuming) disregarded path which needs (shares) these processes. While performing 

1 0 schedule merging, all data dependencies must be respected. 

Example: Modified PCP for the DFG[1] corresponding to the branching tree path 
DCK' 

This shows how the modified PCP approach of this invention out-performs the 

conventional PCP algorithm. Decision taken at each schedule step has been illustrated. 

15 Current time = 1 
Ready List: 1, 11 

Schedule 1 ->PE2 (next schedule time = 4) 1 1 ->PE3 (Next schedule time = 8) 

Current_time = 4 
20 Ready list: 2,3 

There is a conflict; 

one needs to determine the next possible conflict between the remaining tasks dependent on 
2,3. 

25 Possible conflicts on the conflict table: 

Processing 



Node Jndex List of possible conflicts Element 

7 [9] 1 

9 [7] 1 

10 [] 1 

5 [17] 2 
17 [5] 2 

6 [] 3 

8 [] 3 
Case 1:7,9 

Case 2: 5,17 



Table , Conflict Table 

ASAP and ALAP times are used to determine the amount of conflict for each case. For this 
30 example, Case 1 has more conflict. Hence, consider case 1. 
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Now, possible orders of execution: [2,3,7,9], [2,3,9,7],[3,2,7,9],[3,2,9,7]. 

Determine the worst-case execution time for each of these paths and select the order with 

minimum worst-case execution time. 

Worst-case execution times: 
5 [2,3,7,9] ^ 34 
[2,3,9,7] -> 36 
[3,2,7,9] -> 38 

[3,2,9,7] -» 32 < 
Hence, the best execution order is [3,2,9,7], 
1 0 Schedule 3 ->PE1 (next schedule time - 8) 

Current time = 8 (min(next schedule times not yet used as current time)) 
Ready list: 12,2,14,6 

Schedule 14-^PEx (nst = 10) 2 PE1 (nst = 13) 
1 5 There now is a conflict between 6 and 12. 

There are no conflicts between the remaining tasks dependent on 6,12, Therefore the only 

possible orders of execution are: 6,12 and 12,6 

Worst-case execution times: 

[6,12] 22 
20 [12,6] ^ 25 

Therefore, [6,16] is a better choice. 

Schedule 6-*PE3 (nst = 16) 

Current time = 13 
25 Ready list: 5 

Schedule 5->PE2 (nst = 23) 

Current time =16 
Ready list: 12, 8,9 
30 Schedule 9 -» PE1 (nst = 22) 

There is now a conflict between 8 and 12. 

There are no conflicts between the remaining tasks dependent on 8,12. Therefore the only 
possible orders of execution are: 8,12 and 12,8 
Worst-case execution times: 
35 [8,12] 18 
[12,8] -» 15 

Therefore, [12,8] is abetter choice. 
Schedule 12->PE3 (nst = 22) 

40 Current time = 22 
Ready list: 16,8 

There is now a conflict between 8 and 16. 

There are no conflicts between the remaining tasks dependent on 8,16. Therefore the only 
possible orders of execution are: 8,16 and 16,8 
45 Worst-case execution times: 
[8,16] 10 
[16,8] -» 13 

Therefore, [8.16] is abetter choice. 
Schedule 8->PE3 (nst = 26) 
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Current time = 23 
Ready list: 15,7 

Schedule 15 -> PE2 (nst = 28) 7 -> PE1 (nst = 31) 

5 

Current time = 26 
Ready list: 16 

Schedule 16 -> PE3 (nst - 30) 

10 Current time = 30 
Ready list: 17 

Schedule 17 PE2 (nst = 32) 
Current time = 31 
Ready list: 10 
15 Schedule 10 -> PE1 (nst = 36) 

Schedule table entry for DFG[1] for our method and PCP method 



Our PCP 



Process 


DC is: 


DCK 
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.1 
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3 
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4 
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14 
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23 


19 
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30 


30 


Exec. Time 


35 


37 



Table , Schedule Table for DFG (1) 
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Similarly, Schedule table entries can be generated for the remaining DFGs 



Our PCP_ __ 

Process DC^f DCK DCK DC K DCK DC DC 



1 


1 
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9 


4 


9 


4 








9 


9 




9 


5 


13 


9 


13 






13 




6 


8 


14 


8 


13 


13 


8 


13 


7 


23 


19 


23 


14 


14 
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16 
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16 


29 
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31 
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31 


28 


28 


31 
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12 
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13 


13 


14 
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25 
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13 


13 
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19 






19 






16 


26 
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17 


30 


30 


30 


29 
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21 


21 


Exec. T 


35 


37 


35 


32 


32 


35 


39 



Table , Schedule Table for Remaining DFGs 

Branch and Bound scheduling 

5 Arranging the DFG in the decreasing order of their MPCP_delay (Exec T in the tables), one 



DFG[0] -> DC MPCP_delay[0] = 39 
DFG[1] ->DCK_ MPCP_delay[l] = 35 
DFG[2] -> DC K MPCP_delay[2] = 35 
10 DFG[3]-»DC MPCP_delay[3] = 35 
DFG[4] DCK MPCP_delay[4] = 32 
DFG[5] ->DCK MPCP_delay[5] = 32 

Now, one needs to determine the Branch and Bound Schedule for DFG[0]. Branch and 
1 5 Bound gives the optimal schedule. Here, the schedule produced by the modified PCP 

approach of the invention was the optimal schedule in this case. Hence, branch and bound 
also produces the same schedule. Since, the remaining delays are all lesser than the branch 
and bound delay produced, there is no need to do branch and bound scheduling for the 
remaining DFGs. 
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Schedule Merging: 

Schedule merging gives the optimal schedule for the entire CDFG. Optimal schedule should 
take care of the fact that the common processes have the same schedule. If the common 
processes have different schedules, one modifies the schedule with lesser delay. 
5 Schedule merging for (DCK, DC K ) to give the optimal schedule for DC is done here. 
Processes common: 1,2,3,5,6,7,8,9,10,11,12,14,16,17 

From the schedule table, it can be observed that only 14 has a different schedule time. To 
make it equal, we push 14 down the schedule. The modified table is shown below 
DCZ DCK 



Process 


DCK 


before 


after 


1 


1 


1 


1 


2 
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8 
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3 


4 


4 


4 


4 
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13 


13 
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23 


23 
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22 


22 


22 
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16 


16 


16 


10 


31 


31 


31 


11 
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12 


16 


16 


16 


13 








14 


22 


8 


22 


15 




23 


23 


16 


26 


26 


26 


17 


30 


30 


30 


Exec. 








Time 


35 


35 


35 



10 Table , Modified Schedule Table for D C K and DC K 

Schedule merging for D C K and DC f to obtain optimal schedule for D C 
Processes common: 1,2,3,4,6,7,8,9,10,11,12,14,16,17 

Here, all the processes have the same schedule. Hence, there is no need to do schedule 
merging. 

1 5 Schedule merging for DC and DC to obtain optimal schedule for D 
Processes common: 1,2,3,6,7,8,9,10,11,12,14,16,17 
Here, 2,3,6,8,9,10,14,16 have different schedules. 

Hence, one needs to modify the schedules of D C K as it has a lesser delay 
E.g. Interchange schedules of 2 and 3. 
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Table , Modified Schedule Table for DC and D C . 

Schedule merging for D C and DC to obtain optimal schedule for D 
Processes common: 1,2,3,6,7,8,9,10,11,13,14,17 
Here, 2,3,6,7,8,9,10,14 have different schedules. 
5 Hence, one needs to modify the schedules of D C as it has a lesser delay 
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Table , Modified Schedule Table for DC and D C 

Schedule merging for D and D* to obtain optimal schedule for 'true' condition 
Processes common: 1,2,3,6,7,8,9,10,11,14,17 
5 Here, 2,3,6,7,8,9,10,14,17 have different schedules. 

Hence, one needs to modify the schedules of D as it has a lesser delay 
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Table , Modified Schedule Table for D and D 
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Here, schedule for D also needed to be modified without changing the total delay. 
Sometimes, the delay could be worsened due to schedule merging. 
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Table , Final Schedule Table. 

Reconfiguration 

5 Reconfiguration times have not been taken into account in the scheduling of CDFGs. 

An example shows how this time can influence the tightness of a schedule. Consider the 
following task graph (Fig. 23X ). X, V and Z are processes performed by the same 
processing element. 

In the task graph, say 'a' is a variable that influences the decision on which of the two 
10 mutually exclusive paths (dash-dotted or dotted) will be taken, and a is known during run 
time but much earlier than c m' and *z' have started. Let x, v, z and Xbe the times taken by 
processes in the event that V happens to force the dash-dotted path to be taken. Let 0, 5, 77 
be the reconfiguration times for swapping between the processes on the unit. Given these 
circumstances, if nm time scheduling according to [68] is applied, it neglects the 
15 reconfiguration times and provides a schedule of five cycles as shown on the left hand side. 
But if reconfiguration time were to have been considered, a schedule more like the one on the 
right hand side is tighter with 4 clock cycles. This example shows the importance of 
considering reconfiguration time in a reconfigurable processor, if fast swaps of tasks on the 
processing units need to be performed. 
20 Therefore incorporating Reconfiguration time into Control flow graphs involves the 

following steps: 
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i. Special edges are added onto the control flow graphs between a similar set of 
processes, which will be executed on the same processor with or without 
reconfiguration. In other words, these additional edges are inserted and the 
modified PCT scheduling as above is carried out with these in place. 

ii. Reconfiguration times affect the worst-case execution time of loopy codes. So this 
has to be taken care of, when loopy codes are being scheduled. 

iii. Care needs to be taken to schedule the transfer of reconfiguration bit-stream from 
the main memory to the processor memory. 

Loop-based scheduling 

In static scheduling, loops whose iteration counts are not known at compile time 
impose scheduling problems on tasks which are data dependent on them, and those tasks that 
have resource dependency on their processing unit. Therefore, this preferred, exemplary 
embodiment takes into account cases which are likely to impact the scheduling to the largest 
extent and provided solutions. 

Case 1: Solitary loops with unknown execution time. Here, the problem is the execution 
time of the process is known only after it has finished executing in the processor. So static 
scheduling is not possible. 

Solution: (Assumption) Once a unit generates an output, this data is stored at the consuming 
/ target unit's input buffer. Referring to the scheduled chart of Fig. 24X, each row represents 
processes scheduled on a unique type of unit (Processing Element). Let PI be the loopy 
process. 

From Fig. 24X we see that 

P3 depends on PI and P4, 

P2 depends on PI, 

P6 depends on P2 and P5. 

If Pi's lifetime exceeds the assumed lifetime (most probable lifetime or a unit 
iteration), then all dependents of PI and their dependents (both resource and data) should be 
notified and the respective Network Schedule Manager (NSM) and Logic Schedule Manager 
(LSM), of Fig. 27X, should be delayed. Of course, this implies that while preparing the 
schedule tables, 2 assumptions are made. 

1) The lifetimes of solitary loops with unknown execution times are taken as per 
the most probable case obtained from prior trace file statistics (if available and 
applicable). Otherwise unitary iteration is considered. 
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2) All processes that are dependent on such solitary loop processes are scheduled 
with a small buffer at their start times. This is to provide time for notification 
through communication channels about any deviation from assumption 1 at 
runtime. 

5 If assumption 1 goes wrong, the penalty paid is: 

Consider the example in Fig. 21X where two processes in the ready list are being 
scheduled based on PCP. Now by PCP method if \\ > and PI and P2 do not share the 
same resource, then PA is scheduled earlier than PB. It has been assumed that \\ is due to 
most probable execution time of Loop PL But at runtime if Loop PI executes a lesser 

1 0 number of times than predicted and therefore resulting in \v being < Xb, then the schedule of 
PA earlier than PB results in a mistake. 

The time difference between both possible schedules is calculated. It is not, at this point, 
proposed to repair the schedule because all processes before PI have already been executed. 
And trying to fit another schedule at run time, requires intelligence on the communication 
15 network which is a burden. But on the brighter side, if at run time Loop PI executes a greater 
number of times than predicted, then will still be > Xb. Therefore the assumed schedule 
holds true. 

Case 2: A combination of two loops with one loop feeding data to the other in an iterative 
manner. 

20 Solution: Consider a processing element, PA, feeding data to a processing element, PB, in 
such a manner. For doing static scheduling, if one loop unrolls them and treats it in a maimer 
of smaller individual processes, then it is not possible to assume an unpredictable number of 
iterations. Therefore if an unpredictable number of iterations is assumed in both loops, then 
the memory foot-print could become a serious issue. But an exception can be made. If both 
25 loops at all times run for the same number of iterations, then the schedule table must initially 
assume either the most probable number of iterations or one iteration each and schedule 
PA,PB,PA,PB and so on in a particular column. In case the prediction is exceeded or fallen 
short off, then the NSM and LSMs must do 2 tasks: 

1) If the iterations exceed expectations, then all further dependent processes 
30 (data and resource) must be notified for postponement and notified for 

scheduling upon the iterations completion with an appropriate difference in 
expected and obtained at run time, schedule times. If the iterations fall short of 
expectations, then all further schedules must only be preponed (moved up). 
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2) Since the processes PA and PB should denote single iteration in the table, 
their entries should be continuously incremented at run time by the NSM and 
the LSMs. The increment for one process of course happens for a 
predetermined number of times, triggered off by the schedule or execution of 
5 the other process. For example in Fig. 25X, we see that PA =10 cycles, PB = 

20 cycles and hence if both loops run for five times, then the entry in the 
column increments as shown. 
Only in such a situation can there be preparedness for unpredictable loop iteration 

counts. 

10 Case 3: A loop in the macro level i.e. containing more than a single process. 

Solution: In this case, there are some control nodes inside a loop. Hence the execution time 
of the loop changes with each iteration. This is a much more complicated case than the 
previous options. Here lets consider a situation where there is a loop covering two mutually 
exclusive paths, each path consisting of two processes (A,B and C,D) with (3,7 and 15,5) 

15 cycle times. In the schedule table there will be a column to indicate an entry into the loop and 
two columns to indicate the paths inside the loop. Optimality in scheduling inside the loop 
can be achieved, but in the global scheme of scheduling, the solution is non-optimal. But this 
cannot be helped because to obtain a globally optimal solution, all possible paths have to be 
unrolled and statically scheduled. This results in a table explosion and is not feasible in 

20 situations where infinite number of entries in table are not possible. Hence, from a global 
viewpoint the loop and all its entries are considered as one entity with the most probable 
number of iterations considered and the most expensive path in each iteration is assumed to 
be taken. For example in the above case, path C,D is assumed to be taken all the time. 

Now, a schedule is prepared for each path and hence entered into the table under two 

25 columns. When one schedule is being implemented, the entries for both columns in the next 
loop iteration is predicted by adding the completion time of the current path to both column 
entries (of course while doing this care should be taken not to overwrite the entries of the 
current path while they are still being used). Then when the current iteration is completed and 
a fresh one is started, the path is realized and the appropriate (updated / predicted) table 

30 column is chosen to be loaded from the NSM to the LSMs. 
Network architecture 

In order to coordinate the mapping of portions of the schedule table onto 
corresponding CLUs, we propose the following architecture. In Fig. 26X, the interfacing of 
the Reconfigurable unit with the host processor and other I/O and memory modules is shown. 
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The Network Schedule Manager (Fig. 27X) has access to a set of tables, one for each 
processor. A table consists of possible tentative schedules for processes or tasks that must be 
mapped onto the corresponding processor subject to evaluation of certain conditional control 
variables. The Logic Schedule manager schedules and loads the configurations for the 
processes that need to be scheduled on the corresponding Processor ie. all processes that 
come in the same column (a particular condition) in the schedule table. In PCP scheduling, 
since the scheduling of the processes in the ready list depends only on the part of the paths 
following those processes, the execution time of the processes shall initially conveniently 
include the configuration time. 

Once a particular process is scheduled and hence removed from the ready list, another 
process is chosen to be scheduled based on the PCP criteria again. But this time the execution 
time of that process is changed or rather reduced by using the reconfiguration time, instead of 
the configuration time. Essentially, for the first process that is scheduled in a column, 

the completion time = execution time + configuration time. 

For the next or successive processes, 
completion time = predecessor's completion time + execution time + reconfiguration time. 

Assuming that once a configuration has been loaded into the CM, the process of 
putting in place the configuration is instantaneous, it is always advantageous to load 
successive configurations into the CM ahead of time. This will mean a useful latency hiding 
for loading a successive configuration. 

The reconfiguration time is dependent on two factors: 

1) How much configuration data needs to be loaded into the CM (Application 
dependent) 

2) How many wires are there to carry this info from the LSM to the CM (Architecture 
dependent) 

The Network Schedule Manager should accept control parameters from all LSMs. It 
should have a set of address decoders, because to send the configuration bits to the Network 
fabric consisting of a variety of switch boxes, it needs to identify their location. Therefore for 
every column in the table, the NSM needs to know the route apriori. One must not try to find 
a shortest path at run time. For a given set of processors communicating, there should be a 
fixed route. If this is not done, then the communication time of the edges n the CDFG cannot 
be used as constants while scheduling the graph. 

For any edge the, 

communication time = a constant and uniform configuration time + data transaction time. 
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The Network architecture consists of switch boxes and interconnection wires. The 
architecture will be based on the architecture described in [1]. This will be modeled as a 
combination of behavioral" and "Structural" style VHDL. Modifications that will be made 
are: 

a. The Processing Elements derived in section 3 will be used instead of the four 
input LUTs that were used in Andre's model. 

b. RAM style address access will be used to select a module or a switch box on the 
circuit. 

c. Switch connections that are determined to be fixed for an application will be 
configured only once (at the start of that application). 

d. Switch connections that are determined to be fixed for all applications will be 
shorted and the RC model for power consumption for that particular connection 
will be ignored for power consumption calculations. 

e. The number of hierarchy levels will be determined by the application that has the 
maximum number of modules, because there is a fixed number of modules that 
can be connected 

There will be one Network Schedule Manager (NSM) modeled in "Behavioral" and 
"Structural" style VHDL. It will store the static schedule table for the currently running 
application. The NSM collects the evaluated Boolean values of all conditional variables from 
every module. 

For placing modules on the network two simple criteria are used. These are based on 
the assumption that the network consists of Groups of four Processing Unit Slots (G4PUS) 
connected in a hierarchical manner. 

Note: A loop could include 0 or more number of CGPEs. 

Therefore the following priority will be used for mapping modules onto the G4Pus: 

a. A collection of one to four modules which are encompassed inside a loop shall be 
mapped to a G4PUS. 

i. If there are more than four modules inside a loop, then the next batch of four 
modules are mapped to the next (neighboring) G4PUS. 

ii. If the number of CGPEs in a loop >2, then they will have greater priority over 
any FGPEs in that loop for a slot in the G4PUS. 

b. For all other modules: 

iii. CGPE Modules with more than one Fan-in from other CGPEs will be 
mapped into a G4PUS. 
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iv. CGPE Modules with more than one Fan-in from other FGPBs will be mapped 

intoaG4PUS. 

Note: The priorities are based on the importance for amount of communication 
between modules. Both Fan-ins and Fan-outs can be considered, for simplicity, Fan-ins to 
5 CGPEs are considered here only. 
Testing Methodology 

In this research effort, one focuses mainly on reducing the number of reconfigurations 
that need to be made for running an application and then running other applications on the 
same processor. One also aims to reduce the time required to load these configurations from 
10 memory in terms of the number of configuration bits corresponding to the number of 
switches. 

Time to execute an application for a given area (area estimate models of XILINX 
FPGAs and Hierarchical architectures can be used for only the routing portion of the circuit.) 
and a given clock frequency can be measured by simulation in VHDL. 

15 The time taken to swap clusters within an application and swap applications 

(reconfigure the circuit from implementing one application to another) is dependent on the 
similarity between the successor and predecessor circuits. The time to make a swap will be 
measured in terms of number of bits required for loading a new configuration. Since a RAM 
style loading of configuration bits will be used, it is proven [2] to be faster than serial loading 

20 (used in Xilinx FPGAs). Speed above the RAM style is expected for two reasons: 

a) The address decoder can only access one switch box at a time. So the greater the 
granularity of the modules, the fewer the number of switches used and hence configured. 

b) Compared to peer architectures which have only LUTs or a mixture of LUTs and 
CPGEs with low granularity (MAC units), CGPEs are expected to be of moderate granularity 

25 for abstract control-data flow structures in addition to FGPEs. Since these CPGEs are derived 
from the target applications, their granularity to be the best possible choice for a 
reconfigurable purpose is expected. They are modeled in "Behavioral" VHDL and are 
targeted to be implemented as ASICs. This inherently would lead to a reduced amount of 
configurations. 

30 The time taken to execute each application individually will be compared to available 

estimates obtained for matching area and clock specifications from work carried out by other 
researchers. This will be in terms of number of configurations per application, number of bits 
per configuration, number of configurations for a given set of applications and hence time in 
seconds for loading a set of configurations. 
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Regarding power consumption, sources of Power consumption for a given application 
can be classified into four parts: 

a. Network power consumption due to configurations with an application. This is 
due to the Effective Load Capacitance on a wire for a given data transfer from one module to 

5 another for a particular configuration of switches. 

Note: The more closed switches a signal has to pass through, the more the 
effective load capacitance and resistance. Shorted switches are not considered to contribute to 
this power. 

b. Data transfer into and out of the Processor 

10 Note: This can have a significant impact on the total power in media rich or 

communication dominated applications ported onto any processing platform. 

c. Processing of data inside a module. 

Note: This will require synthesizable VHDL modules. But since the focus 
here is on reducing power due to reconfiguration, this is presently left for future work. 

15 d. The Clock distribution of the processor. 

Note: This can be measured if the all parts of the circuit are synthesizable. But 
the focus here is on a modeling aspect and this measurement is not presently considered. 

At the level of modeling a circuit in VHDL, it is possible to only approximately 
determine the power consumptions. One can use the RC models of XILINX FPGAs and [1] 

20 architectures to get approximate power estimates. Power aware scheduling and routing 

architecture design are complex areas of research in themselves and are not the focus here. 
Here the focus is on reducing the amount of reconfigurations, which directly impacts the 
speed of the processor and indirectly impacts the power consumption to a certain extent. 
Overall Architecture 

25 Tool Set: Profiling, Partitioning, Placement and Routing 

One aspect of the present invention aids the design, the circuitry or architecture of a 
dynamically reconfigurable processor through the use of a set of analysis and design tools. 
These will help hardware and system designers arrive at optimal hardware software co- 
designs for applications of a given class, moderately complex programmed applications such 

30 as multimedia applications. The reconfigurable computing devices thus designed are able to 
adapt the underlying hardware dynamically in response to changes in the input data or 
processing environment. The methodology for designing a reconfigurable media processor 
involves hardware-software co-design based on a set of three analysis and design 
tools[AK02]. The first tool handles cluster recognition, extraction and a probabilistic model 
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for ranking the clusters. The second tool, provides placement rules and feasible routing 
architecture. The third tool provides rules for data path, control units and memory design 
based on the cloisters and their interaction. With the use of all three tools, it becomes possible 
to design media(or other) processors that can dynamically adapt at both the hardware and 
5 software levels in embedded applications. The input to the first tool is a compiled version of 
the application source code. Regions of the data flow graph obtained from the source code, 
which are devoid of branch conditions, are identified as zones.Clusters are identified in the 
zones, by representing candidate instructions as data points in a multidimensional vector 
space. Properties of an instruction, such as location in a sequence, number of memory 

10 accesses, floating or fixed-point computation etc., constitute the various dimensions. As 
shown in Ali Fig. 1, clusters obtained from the previous tool are placed and routed by Tool 
number 2, according to spatial and temporal constraints (AH Fig. 2). The processor (of the 
compiler) can be any general purpose embedded computing core such as an ARM core or a 
MIPS processor These are RISC cores and hence are similar to general purpose machines 

15 such as UltraSPARC The output of the tool is a library of clusters and their interaction. (A 
Cluster comprises of sequential but not necessarily contiguous assembly level instructions). 
The clusters represent those groups or patterns of instructions that occur frequently and hence 
qualify for hardware implementation. To maximize the use of reconfigur ability amongst 
clusters, possible parallelism and speculative execution possibilities must be exploited. 

20 Referring to Ali Fig. 1, the methodology for designing a reconfigurable media 

processor involves hardware-software co-design based on the set of three analysis and design 
tools [ 83,84], The first tool is the profiling and partitioning step that handles cluster 
recognition, extraction and a probabilistic model for ranking the clusters. The second tool, 
provides placement rules and a feasible routing architecture. The third tool provides rules for 

25 task scheduling, data path, control units and memory design based on the clusters and their 
interaction. Tool-three generates all possible execution paths and corresponding scheduling 
tables for each. Following that it maps the tasks into the reconfigurable area. As a 
modification, the proposed approach, instead of using compiled version of the MPEG4 
decoder source code, intermediate three-address code is generated from the high level C 

30 code. Machine independence and control flow information are still kept as is with this 

approach. Partitioning tool analyzes the intermediate code and extracts the control-data flow 
graph (CDFG). Each bulk of pure data dependent code in between the control structures is 
defined as a zone. Then the partitioning tool runs a longest common subsequence type of 
algorithm to find the recurring patterns between potential zones to run on hardware. Building 
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blocks represent those groups or patterns of instructions that occur frequently and hence 
qualify for hardware implementation. By pattern one means a building block that consists of 
a control flow structure. A pattern may also include a group of building blocks that are only 
data dependent. Control structure may be a combination of if-else and loop statements with 
5 nested cases. Output of the partitioning tool is a library of building blocks and their 

interaction. Interaction information includes how many times two building blocks exchange 
data and size of the data exchanged. The tool also provides number of clock cycles required 
to execute each building block. In addition, input output pins and area information for each 
building block are also provided. With this information an interconnection pattern can be 

10 determined prior to execution. That helps to exploit the locality to thereby simplify the 
interconnection structure and reduce the usage of global buses, fan-ins and fan-outs. The 
placement tool places the building blocks that are exchanging data more frequently close 
together. Clusters obtained from Tool 1 are placed and routed by Tool 2, according to spatial 
and temporal constraints as diagrammatically illustrated in Ali Fig. 2. To maximize the use 

15 of reconfigurability amongst clusters, possible parallelism and speculative execution 
possibilities are exploited. 
Heterogeneous Hierarchical Architecture 

Aggarwal [85] says that hierarchical FPGAs (H-FPGAs) can implement circuits with 
fewer routing switches in total compared to symmetrical FPGAs. According to Li [86], for 

20 H-FPGAs the amount of routing resources required is greatly reduced while maintaining a 
good routability. It has been proved that the total number of switches in an H-FPGA is less 
than in a conventional FPGA under equivalent routability [87]. Having fewer switches to 
route a net in H-FPGAs reduces the total capacitance of the network. Therefore it can 
implement much faster logic with much less routing resources compared to standard FPGA. 

25 H-FPGAs also offer advantages of more predictable routing with lower delays. Hence the 
density of H-FPGAs can be higher than conventional FPGAs. In the case of the present 
invention, compared to hierarchical and symmetrical FPGA approaches, building blocks are 
of variable size. Classical horizontal, vertical channel will not result in an area efficient 
solution Consistent channel capacity at each hierarchy level will not work because of the 

30 variable traffic between the building blocks even at the same hierarchy. Due to variable 
traffic among clusters and non-symmetric characteristics, different types of switches are 
needed at each hierarchy level. All these factors result in heterogeneity between groups of 
building blocks at the same hierarchy level as opposed to classical H-FPGA approach. 
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Therefore a heterogeneous hierarchical routing architecture that makes use of the 
communication characteristics is essential to implement power and time efficient solution. 
Proposed Architecture 

The network scheduler, building blocks, switches and wires form the reconfigurable 
unit of present invention. A profiling and partitioning tool lists building blocks such as 
B={Bi, B 2 , Bk} where BjeB. Based on data dependency between the building blocks, disjoint 
subsets of B are grouped together to form clusters. A building block should appear only in 
one cluster. 

In Ali Fig. 4(a), at time t=tj , Bi receives (a,b) and (c,d) from memory. If multiple 
copies of Bi are available, then without a resource conflict both will run at the same time. 
However that would work against the definition of a reconfigurable solution. In second 
scenario (Ali Fig. 4(b)), Bi processes data of the most critical path first, (B3 B2 or B5 B4) 
while the second path is waiting. For such resource or scheduling conflicts we introduce 
network scheduler module, which is a controller unit over the reconfigurable area. Handling 
dynamic reconfiguration and context switching are the major tasks of this unit. Most critical 
path is initially loaded into network scheduler. At run time, if a path that is not on the critical 
path needs to be executed, it is the network scheduler's job to do context switching and 
loading the schedule for that new path. The network scheduler offers control mechanism over 
data transmission between building blocks as well. Buffering is needed when receiver needs 
to process bulks of data at a time. For a given context if consumer demands data in a block 
manner then the receiver should rearrange the incoming data format. Both sender and 
receiver should be context aware. Buffers are only kept at the receiver side. A producer 
simply dumps the data to the bus as soon as it is available. The receiver should be aware of 
the context of each request and make a decision based on the priority in order to prevent 
collision. If the receiver needs to get data from more than one sender, then those senders, 
which are in the ok list, are allowed to transmit data whereas other requests should be denied. 
This is again handled by the collusion prevention mechanism. The connection service 
mechanism brings a control overhead cost however it provides controlled router service, 
efficient resource usage and parallelism. 

As shown in Ali Fig. 5, clusters of building blocks form level-1 (M) modules. 
Similarly clusters of M modules form level-2 (C) modules. One defines two types of 
switches: local (LS) and gateway switches (GS). Local switches function within level-1 and 
level-2 modules. Gateway switches allow moving from one hierarchy level to another. 
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Depending on the place of LS or GS, there may be multiple LSs needed for LS to LS 
connections. Connection between the building blocks of the same level-2 module is handled 
through only local switches. For all other connections gateway switches distribute the traffic 
as shown in Ali Fig. 6. Building block uses local global bus to connect to gateway switch of 
5 the module that building block belongs to. Bus capacity and gateway switch complexity 
increase as the hierarchy increases and switches are variable in flexibility even at the same 
hierarchy level. 

Level- 1 blocks use local global bus to connect to the gateway switch of the cluster 
that the building block belongs to. If a block in module 2 of cluster 1 sends data to a block in 
10 module 1 of cluster 2, data goes through the global buses only following Source Block, GS in 
CI, GS in Level3, GS in C2 and finally reaching the Destination Block Ali Fig. 6. Dashed 
lines represent the local connection through local switches. 
Methodology 

As indicated in Ali Fig. 7, the methodology in accordance with this invention, 
15 involves implementation of packing, hierarchy formation, placement, network scheduling and 
routing tools. New cost function metrics are generated for the routability driven packing 
algorithm. The cost function takes into account each possible execution path of the 
application obtained from a given CDFG, library of variable size building blocks, building 
block timing and dependency analysis. The cost function will simplify the complexity of the 
20 placement and routing steps since constraints of these steps are evaluated as early as at the 
packing step. 
Packing 

Several time or area driven packing with bottom-up or top-down approaches have 
been proposed. As shown in Ali Fig. 7, the present methodology is a bottom-up approach. In 

25 Lookup Table (LUT) based, or building block based reconfigurable solutions, increasing the 
complexity of the processing element increases functionality and hence decreases the total 
number of logic blocks used by the application and the number of logic blocks on the critical 
path. For a fine-grained approach, more logic blocks will be required to implement the 
circuit. The routing area then may become excessive. In coarse-grained logic, much of the 

30 logic functionality may be unused wasting area. There is a tradeoff between the complexity 
of logic blocks and area efficiency. A cost function is needed to make the decision of 
inserting a building into one of the candidate clusters. [93] uses a sequential packing 
algorithm with a cost function depending on the number of intersecting nets between a 
candidate cluster and building block. As a modification to this approach [94] uses time driven 
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packing that has the objective of minimizing the connection between the clusters on critical 
path. Building blocks are packed sequentially along the critical path. [95] and [96] are 
routability driven packing approaches that incorporate routability metric such as density of 
high fan out nets, traffic in and out of the logic block , number of nets and connectivity into 
5 packing cost function. All of these approaches are based on fixed K input LUT and N number 
of LUTs in a cluster. In addition to having variable size building blocks, the present approach 
takes into account the control data flow graph of each possible execution path to be handled 
by the reconfigurable unit 

For an if-else statement, at compile time one doesn't know if or the else part of the 

10 statement will be executed. Similarly one may not know how many times a loop will execute. 
Packing of building blocks should be in favor of all possible execution paths. Given that 
configuration is based on the if part of a control statement, when else part of the path is to be 
executed, the network scheduler should do least amount of reconfigurations. Ali Fig. 8(a) 
shows a simple if-else statement with building blocks inside the control structure. As shown 

15 in Ali Fig. 8(b), since two paths can't execute at the same time, clustering tool groups the 
building blocks that are within the same statement (if or else) as shown in Ali Fig. 7. If a 
building block that is appearing in the else part happens to occur on the path of PathJL then 
the network scheduler handles the connection between the two clusters through global 
switches. Since the architecture needs to reconfigure at run time, the present approach 

20 prioritizes time over the area constraint. Possible waste of area during clustering because of 
irregular building block or irregular cluster shapes at higher hierarchy level is ignored as long 
as the time constraint is satisfied. In addition to the metrics defined in [91, 92], the present 
invention incorporates the scheduling information into its cost function. Cost of adding a 
building block into a cluster depends on how timing of the circuit is affected at different 

25 possible execution paths. At the packing step the tasks of placement and routing are 

simplified. A set of building blocks, a CDFG for each possible execution scenario, the input, 
output pins of each building block, the number of cycles required by each building block, the 
scheduling information for all possible execution scenarios are used by the packing tool. The 
inventors have encountered no work that has been done on packing variable size building 

30 blocks into variable size clusters using CDFG, execution path and scheduling analysis 
information. 

The packing tool groups the building blocks into level- 1 type clusters. Then those 
clusters are grouped together to form level-two and higher levels. At each hierarchy level, 
existing clusters and their interaction information are used to form higher-level clusters one 
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step at a time. As seen in the example, in the hierarchy formation step (Ali Fig. 7), the 

process continues recursively until level-three is reached. 

Placement 

For a level-one cluster, let n be the number of building blocks, Cy be the number of 
5 occurrences of a direct link between building blocks Bi and Bj; Dy be the amount of data 

traffic in terms of number of bits transferred between the blocks Bi and Bj through direct links 
where 1 < i < n,l < j < n . Then cost of data exchange between the two library modules B\ and 
Bj is defined as: 



block should be placed to north, south, east or west of another block. This is established by 
using the dependency information. Then placement algorithm uses modified simulated 
annealing method by incorporating the orientation information obtained in this step, which 
helps making intelligent placement decisions. The objective of pre-placement is to place the 

1 5 pairs of building blocks that have the most costly data exchange closest to each other. As the 
cost of the link decreases the algorithm tolerates to have a Manhattan distance of more than 
one hop between the pairs of building blocks. This phase guarantees area allocation 
improvement because building blocks are placed based on their dependency leading to usage 
of less number of switches or shorter wires to establish a connection between them. Integer 

20 programming technique is used to make the decision of the orientation of the building blocks 
with respect to each other. Given that there are n numbers of building blocks, in the worst- 
case scenario, if the blocks are placed diagonally on a grid (assuming that each block is unit 
size of one) then the placement is done on an nxn matrix. Let Pi(x,y) denote the (x,y) 
coordinates of the building block Bj and no other building block have the same (x,y) 

25 coordinates. The objective function is: 



Ali Fig. 9(a) shows the cost matrix of given six blocks (A,B,CJ),E,F). Those six 
nodes are treated as points to be placed on a 6x6 matrix. The output of pre-placement is 
3 0 shown in Ali Fig. 9(b). 

Since scheduling, CDFG and timing constraints have already been incorporated in the 
packing algorithm, the placement problem is made simpler. After completing virtual 
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placement for each level-one cluster, the same process continues recursively for level-two 
and higher levels of clusters. 

Implementation Results: 

Target Device : x2s200e 
5 Mapper Version : spartan2e — SRevision: 1.16 $ 



1 Resource 


2 Bits 


1J Configuration file size 


1,442,016 


2) Block RAM bits 


57,344 


3) bits used for logic 


1,384,672(1-2) 


Bits /Slice 


-588 




Resource 


Bits 


Configuration Storage 




588 bits/slice * 4 gates/bit 


2352 


Behavior 




588 bits/slice * 1 gate/bit 


588 


Total gates /slice 


2940 



10 The common part of the Affine-Perspective loop / pre-loop: 

Total number of slices used = 893 / 1590 slices 

Number of bits = 893 / 1590 slices x 588 bits/slice 
15 = 525,084 / 1,419,870 bits of configuration 

Number of gates = 2940 gates/slice * 893 / 1590 slices 
- 2,625,420/4,674,600 

20 Number of equivalent gates (ASIC) as given by Xilinx map report = 23,760 / 32,548 

(Actual gate counts are accepted to be exaggerated by a factor of 5 by Xilinx) 
Therefore a better estimate of the equivalent gate count = 4752 / 6509 

25 

Configuration: 

Configuration speed for Xilinx Spartan 2E chip = 400Mb per sec (approx.) 

30 Time to configure pre-loop bits= 3.549 ms (1,419,870 divided by 400Mb per sec) 

Time to configure loop bits = 1 .3 1 2 ms (525,084 divided by 400Mb per sec) (A) 

Max. Clock frequency for loop / pre-loop « 58.727 / 52.059 Mhz 
35 Clock period =17.028 / 19.2089 ns (B) 
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Therefore number of clocks saved in using ASIC for the loop = A divided by B 

= 77,000 clock cycles (approx.) 

Therefore number of clocks saved in using ASIC for the pre-loop = A divide by B 

= 184,000 clock cycles (approx.) 

Although preferred embodiments of the invention have been described in detail, it 
will be readily appreciated by those skilled in the art that further modifications, alterations 
and additions to the invention embodiments disclosed may be made without departure from 
the spirit and scope of the invention as set forth in the appended claims. 
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Appendices: 
Appendix A 

5 A Control Data Flow Graph consists of both data flow and control flow portions. In 

compiler terminology, all regions in a code that lie in between branch points are referred 
to as Basic Blocks. Those basic blocks which have additional code due to code 
movement, shall be referred to these as zones because. Also under certain conditions, 
decision making control points can be integrated into the basic block regions. These 
10 blocks should be explored for any type of data level parallelism they have to offer. 

Therefore for simplicity in the following description, basic blocks are referred to as 
zones. The methodology remains the same when modified basic blocks and abstract 
structures such as nested loops and hammock structures etc are considered as zones. 

15 High level ASNI C code of the target application is first converted to an assembly code 

(UltraSPARC). Since the programming style is user dependent, the assembly code needs 
to be expanded in terms of all functions calls. To handle the expanded code, a suitable 
data structure that has a low memory footprint is utilized. Assembly instructions that act 
as delimiters to zones must then be identified. The data structure is then modified to lend 

20 itself to a more convenient form for extracting zone level parallelism. 

The following are the steps involved in extracting zone level parallelism. 
Step-1: Parsing the assembly files 

In this step for each assembly (.s) file a doubly linked list is created where each node 
25 stores one instruction with operands and each node has pointers to the previous and 

next instructions in the assembly code. Parser ignores all commented out lines, lines 
without instructions except the labels such as 
Main: 
.LL3: 

30 Each label starting with .LL is replaced with a unique number (unique over all 

functions) 

Step-2: Expansion 

Each assembly file that has been parsed is stored in a separate linked list. In this step 
35 the expander moves through the nodes of linked list that stores main.s. If a function 

call is detected that function is searched through all linked lists. When it is found, that 
function from the beginning to the end, is copied and inserted into the place where it 
is called. Then the expander continues moving through the nodes from where it 
stopped. Expansion continues until the end of main.s is reached. Note that if an 
40 inserted function is also calling some other function expander also expands it until 

every called function is inserted to the right place. 

In the sample code (Appendix B), mainO function is calling the findsumO function 
twice and findsumO function is calling the findsubO function. The expanded code 
(after considering individual assembly codes (Appendix C) is shown in Appendix-D. 

45 

Step-3: Create Control Flow Linked List 

Once the main.s function has been expanded and stored in a doubly linked list, the 
next step is to create another doubly linked list (controljflow_linked_list) that stores 
the control flow information. This will be used to analyze the control flow structure of 
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the application code, to detect the starting and ending points of functions and control 
structures (loops, if.. else statements, etc.). 

As the expanded linked list is scanned, nodes are checked if they belong to a: 

• Label or 

• Function or 

• Conditional or 

• unconditional branch 

In which case, a new node is created to be appended to the control flow linked list by 
setting the member pointers as defined below. 

If the current node is a 

• function label 

A pointer to the expanded list pointing to the function label node 

A pointer to the expanded list pointing to the beginning of the function (the next node 

of the function label node) 

A pointer to the expanded list pointing to the end of the function 
And node type is set to "function". 

• label 

A pointer to the expanded list pointing to the function label node 

A pointer to the expanded list pointing to the beginning of the label (the next node of the label 
node). 

And node type is set to "square". 

• unconditional branch (b) 

A pointer to the expanded list pointing to the branch node 

A pointer to the control flow linked list pointing to the node that stores the matching 
target label of the branch instruction. 
And node type is set to "dot" 

• conditional branch (bne, ble, bge, ...etc) 

A pointer to the expanded list pointing to the branch node 

A pointer to the control flow linked list pointing to the node that stores the matching 
target label of the branch instruction. 
And node type is set to "circle". 

The control flow linked list output for the findsum.s function is shown in Appendix D. 
Step 4: Modification of Control Structure 

The control structure linked list (which essentially represents the control flow graph 
of the candidate algorithm) is then modified as follows. 

• The pointers from unconditional branch nodes (also called "dot" nodes) to the 
next node in the list need to be disconnected and made NULL. Hence for the "dot" 
node: 

node-> next = NULL 
for the following node: 
node-* previous = NULL 

{Exception: if the next node of the "dot" node is itself the target node !} 
• The target nodes of the unconditional branches need to be marked as "Possible Exit" 
nodes. These "Exit" classes of nodes are a subset of the regular 'Target* * or "Square" 
nodes. 
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• If unconditional branch node's rank is higher than target node's rank (indicating a 
feed back or loop), disconnect the link and mark as NULL. 

Hence for the "dof ' node: 
node-* to_ target = NULL 
5 But before disconnecting, mark target-> next (which should be a circle) as "loop 

node". 

• In a special case, if an unconditional branch and a square share the same node, 
then the target of that unconditional branch is declared as an exit square with a 
loop type (because, instructions following this square, comprise the meat of the 

10 do-while loop). This exit square, will not have its next-> pointing to a circle. The 

circle is accessed through the dot node using the previous-* pointer. Then it is 
marked off as type loop. 

• If a "Possible Exit" node has 2 valid input pointers, and rank of both source 
pointers is lesser than the node in consideration, then it is an "Exit" node and, 

15 disconnect the link to the corresponding "dot" node, and hence also mark that 

"dof* node's target pointer to NULL. In other words, if the node-* previous 
pointer of the "square/target" node of the "dof node does not point to the "dot" 
node, then it has 2 valid pointers. 
Hence for the "dof* node: 

20 node-* tojarget = NULL 



For a sample high level code in the Figure 1 below, following which is the 
expanded assembly file. The control flow linked list is as shown in Figure 2. After 
modifications to this linked list a structure as indicated in figure 3 is obtained. 

25 

#include<stdio.h> e j se 
void mainO ^ 

t 1=10; 

mt m=n+r; 
i^j==0,k==0,l=0,m=0,n===0,p=0,r=0; j 

30 ..... k = k-14; 



45 



{ 

n = 9; 



for(i=l;i<10;i++) k = 7 m g * p . 

t rt while(i<p) 
p = p-8; { 

P = P* 7 ' P = P*20; 

>_ P=p-7; 
35 i-i+l; while(k = 8) 

{ 

P=P+ 17; 

if(k*» } i = l * K 

{ P=p-23; 
40 P=19; J 

} 

else 

{ 

r = 23; i 

} 



m = m+5; 
n = n +4; 



} 



Figure 1: An Example Program 



The gcc (version 2.95.2) compiled code for the UltraSPARC architecture with node 
labeling is as follows: 

50 
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.file "loop_pattern4.c" 
gcc2_compiled. : 

.global .umul 
.section ".text" 

.align 4 

.global main 

.type main,#function 

.proc 020 



main: 



.1X3: 



.LL6: 



!#PROLOGUE# 0 
save %sp, -144, %sp 
!#PROLOGUE# 1 



.LL5: 



XL4: 



st 
st 
st 
st 
st 
st 
st 
st 

mov 
st 

Id 

cmp 
ble 
nop 
b 

nop 

Id 
add 
st 
Id 

mov 
sll 
sub 
st 

Id 
add 
st 
b 

nop 

Id 

add 

st 

Id 

Id 



%g0, [%ft>-20] 
%g0, [%ftv24] 
%g0, [%ft>-28] 
%g0, [%fi5-32] 
%g0, [%fp-36] 
%g0, [%fp-40] 
%g0, [%fi>-44] 
%g0, [%fp-48] 
1, %o0 

%o0, [%fp-20] 

[%fp-20], %o0 
%o0, 9 
.LL6 

XL4 



[%Q)-44], %o0 
%o0, -8, %ol 
%ol, [%fij-44] 
[%ft5-44], %o0 
%o0, %ol 
%ol, 3, %o2 
%o2, %o0, %o0 
%o0, [%fti-44] 

[%fp-20], %o0 
%o0, 1, %ol 
%ol, [%fy-20] 
.LL3 



[v%J5)-20], %o0 
%o0, 1, %ol 
%ol, [%ft>-20] 
[%fp-20], %o0 
[%Qj-24], %ol 



ground 



square 3 
circle 6 
dot 4 

square 6 



square 5 



dot 3 



square 4 
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.LL8: 



.LL9: 



XL7: 



.LL10: 



.LL11: 



.LL13: 



cmp 

bne 

nop 

mov 

st 

Id 

cmp 

ble 

nop 

mov 

st 

b 

nop 

mov 
st 

mov 

st 

b 

nop 

mov 

st 

Id 

Id 

add 

st 

Id 
add 
st 
Id 

mov 

sll 

mov 

sub 

st 

Id 
Id 

cmp 
bl 
nop 
b 

nop 
Id 

mov 
sll 



%o0, %ol 
XL7 

9, %o0 

%o0, [%fp-40] 
[%fp-28], %o0 
%o0, 0 
.LL8 

19, %o0 
%o0, [%fp-44] 
XL9 



23, %o0 
%o0, [%fp-48] 

25, %o0 
%o0, [%fp-40] 
XL10 



10, %o0 
%o0, [%fp-32] 
[%fp-40], %o0 
[%fp-48], %ol 
%o0, %ol, %o0 
%o0, [%fp-36] 

[%fp-28], %o0 
%o0, -14, %ol 
%ol, [%fp-28] 
[%fp-44], %o0 
%o0, %ol 
%ol, 3, %o0 
7, %ol 

%ol, %o0, %oO 
%o0, [%fp-28] 

[%fp-20], %o0 
[%fp-44], %ol 
%o0, %ol 
XL13 

XL12 



[%Q>-44], %o0 
%o0, %o2 
%o2, 2, %ol 



circle 7 



circle 8 

dot 9 

square 8 

square 9 
dot 10 

square 7 



square 10 



square 11 

circle 13 
dot 12 

square 13 



73 



WO 2004/072796 



PCT/US2004/003609 



.LL14: 



.LL16: 



XL15: 



.LL12: 



add 

sU 

st 

Id 

add 

st 

Id 

cmp 
be 
nop 
b 

nop 

Id 

add 

st 

Id 

Id 

call 

nop 

st 

b 

nop 

Id 
add 
st 
b 

nop 

Id 

add 

st 

Id 

add 

st 



.LL2: 



%ol, %o0, %ol 
%ol, 2, %o0 
%o0, [%fp-44] 
[%fp-44], %o0 
%oO, -7, %ol 
%ol, [%fp-44] 

[%fp-28], %o0 
%oO, 8 
XL16 

.LL15 



[%fp-44], %o0 
%oO, 17, %ol 
%ol, [%fp-44] 
[%fp-20], %60 
[%fp-44], %ol 
.umul, 0 

%o0, [%Q>-20] 
.LL14 



[%fp-44], %o0 
%o0, -23, %ol 
%ol, [%fp-44] 
.LL11 



[%fp-36], %o0 
%o0, 5, %ol 
%ol, [%Q)-36] 
[%fp-40], %o0 
%o0, 4, %ol 
%ol, [%fp-40] 



ret 

restore 



.LLfel: 



square 14 
circle 16 
dot 15 

square 16 



dot 14 



square 15 



dot 11 



square 12 



square 2 



.size main,.LLfel -main 

.ident "GCC: (GNU) 2.95.2 19991024 (release)" 
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Figure 2: Control Flow Linked List 
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To extract all possibilities of parallelism and reconfiguration, zones are identified in 
the modified structure. But to identify such sections, delimiters are needed. A 
delimiter can be any of the following types of nodes: 

(i) Circle 

(ii) Dot 

(iii) Exit square 

(iv) Square 

(v) Power 

(vi) Ground. 

A Circle' can indicate the start of a new zone or the end of a zone. A 'Dot' can only 
indicate the end of a zone or a break in a zone. An 'Exit square' can indicate the start 
of a new zone or the end of a zone. A 'Square* can only indicate the continuation of a 
break in the current zone. A Tower' can only indicate the beginning of the first zone. 
A 'Ground* can only indicate the end of a zone. 

Figure 4 shows example zones to illustrate the use of delimiters. Three zones, 1, 2, 
and 3 all share a common node, 'Circle 6*. This node is the end of Zone 1 and the 
start of zones 2 and 3. Zone 1 has the 'Power' node as its start, while Zone 6 has 
'Ground' node as its end. The 'Dot 3' in Zone 3 indicates the end of that zone while 
'Dot 4' indicates a break in Zone 2. This break is continued by 'Square 4'. In Zone 4, 
'Square 9' indicates the end of the zone while it marks the start of Zone 5. 
This function identifies zones in the structure, which is analogous to the numbering 
system in the chapter page of a book. Zones can have sibling zones (to identify if/else 
conditions, where in only one of the two possible paths can be taken {Zones 4 and 7 
in Figure 1}) or child zones (to identify nested control structures {Zone 10 being child 
of zone 8 in Figure 1}). Zone types can be either simple or loopy in nature (to identify 
iterative loop structures). The tree is scanned node by node and decisions are taken to 
start a new zone or end an existing zone at key points such as circles, dots and exit 
squares. By default, when a circle is visited for the first time, the branch taken path is 
followed. But this node along with the newly started zone is stored in a queue for a 
later visit along the branch not taken path. When the structure has been traversed 
along the "branch taken" paths, the nodes with associated zones are popped out from 
the stack and traversed along their "branch not taken" paths. This is done till all nodes 
have been scanned and stack is empty. 
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Figure 4: Zones in the Modified Structure 



The Pseudo code for the above process is given below: 
Global variables: pop flag = 0, tree_empty = 0; 

Zonise (node) /* input into the function is the current node, a starting node */ 
{ 
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while (tree_empty = 0) /* this loop goes on node by node in the tree till all 

have been scanned */ 

{ 

if (node -» type = circle) 
{ 

if (pop_flag != set) /* pop flag is set when a pop operation is done */ 
{ 

/* an entry here means that the circle was encountered for the 



10 first 



time */ 

/* so set the node-> visited flag */ 
/* close the zone */ 

/* since u r entering a virgin circle, u cant create the new zone 



15 as a 



sibling to the one u just closed */ 
/* if the zone u just closed, has a valid Anchor Point and if its 
of 

type Loop and if its visited flag is set, then u cannot create a 
20 child zone */ 

/* accordingly create a new zone */ 

/* set child as current zone*/ 

/* push this zone and the node into the queue */ 

/* take the taken path for the node, i.e node = node-^ taken */ 

25 } 

if (pop Jflag = set) 

{ 

/* an entry here means, that we r visiting a node and its 
associated 

30 zone, that have just been popped out form the queue, hence 

revisiting an old node */ 
/* since this node has its visited flag as set, change that flag 

value 

to -1, so as to avoid any erroneous visit in the fixture */ 

35 

/* if node is of type Non Loop, then spawn a new sibling zone 

*/ 

/* if node is of type Loop, then spawn new zone as laterparent 

zone 

40 and mark zone type as loop*/ 

/* choose the not taken path for the node */ 

> 

45 else if (node-* type = exit square) 

{ 

/* close the zone */ 

/* if the closed zone has a parent, i.e zone-* parent pointer is not 

NULL, 
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then create a new zone with link to the parent zone as type next zone 

*/ 

/* if the closed zone does not have a parent, then spawn a new zone 

that is 

next to the closed zone */ 
/* choose the not taken path for the node */ 

} 

else if (node-> type is dot and node-> taken = NULL) 
{ 

/* close zone */ 

/* choose node to be considered next by popping out from the queue */ 
/* in case the queue is empty, all nodes in tree have been scanned */ 
/* set pop flag */ 

} 

else if (node-> type = dot and node-* taken != NULL) 

{ 

/* this is just a break in the current zone */ 
/* create temp stopl and tempstartl pointers*/ 
/* choose node-* taken path */ 

} 

}/* end of the first while loop */ 

> 

Once the zones have been identified in the structure, certain relationships can be 
observed among them. These form the basis of extraction of parallelism at the level of 
zones. A zone inside a control structure is the 'later child' of the zone outside the 
structure. Hence the zone outside a control structure and occurring before (in code 
sequence) the zone inside a control structure is a 'former parent' of the zone present 
inside. But, the zone outside a control structure and occurring after (in code sequence) 
the zone inside the structure is referred to as the 'later parent'. Similarly the child in 
this case would be a 'former child'. A zone occurring after another zone and not 
related through a control structure is the 'next' of the earlier one. After parsing 
through the structure thru the zonal relationship as shown in Figure 5 is obtained. 
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S: sibling relationship 
LC: later child relationship 
Lp: later parent relationship 

In all types, destination zone is (Ic/s/lp) of source zone 
The shaded zones are Loop types. 



Figure 5: Initial Zone Structure obtained 

This is referred to as the 'initial zone structure'. The term initial, is used because, 
some links need to be created and some existing ones, need to be removed. This 
process is explained in the section below. 

Step 6: Further Modification of the 'initial zone structure' 

Some of the relationships that were discussed in the previous step cannot exist with 

the existing set of links and others are redundant. For example in in Figure5, we see 

that Zl can be connect to Z2 thru 'n' 

Z12 can be connected to Z13 thru 'lp' 

Z13 can be connected to Z6 thru 'n' 

Z8 can be connected to Z9 thru 'n' 

Z4 can be connected to Z5 thru 'lp' 

Z5 can be connected to Z13 thru 'lp' 

Z7 can be connected to Z5 thru 'lp' 

But Z8's relationship to Z6 thru 'lp' is false, coz no node can have both 'n' and 'lp' 

links. 

In such a case, the 'lp' link should be removed. 

Therefore some rules need to be followed to establish 'n' and 'lp' type links, if they 

don't exist. 

To form an 'n' link: 

If a zone (1) has an 'lc' link to zone (2), and if that zone (2) has a 'lp' link to a zone 
(3), then an 'n' link can be established between 1 and 3. This means that if zone (1) is 
of type 'loop', then zone (3) will now be classified as type 'loop' also. 
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To form an 'lp' type links if it doesn't exist: 

If a zone (1) has an *fp' link to zone (2), and if that zone (2) has an 'n' link to a zone 
(3), then an 'lp' link can be established between 1 and 3 

If a zone (1) has an 'lp' link to zone (2), and also has an 'n' link to zone (3), then first, 
remove the *lp' link 'to zone (2)* from zone (1) and then, place an c lp' link from zone 
(3) to zone (2). 

This provides the 'comprehensive zone structure' as shown in Figure 6 (with 
cancelled links) and in Figure 7 (with all cancelled links removed). 




Figure 6: Comprehensive Zone structure with cancelled links shown 




Figure 7: Comprehensive Zone structure with cancelled links removed 
To identify parallelism and hence compulsorily sequential paths of execution, the 
following approach is adopted. Firstly, the comprehensive zone structure obtained, is 
ordered sequentially by starting at the first zone and traversing along an c lc — lp' path. 
If a Sibling link is encountered it is given a parallel path. The resulting structure is 
shown in Figure 8. 
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Figure 8: Sequentially Ordered Zones 

To establish parallelism between a zone (1) of loop count A and its upper zone (2) of 
loop count B, where A < B, check for data dependency between zone 1 and all zones 
above it upto and including the zone with the same loop count as zone 2.. 
In the example above, to establish parallelism b/w zone 6 and zone 9, check for 
dependencies b/w zone 6 and 9, 10, 8. If there is no dependency then zone 6 is parallel 
to zone 8. 

To establish parallelism between a zone (1) of loop count A and its upper zone (2) of 
loop count B, where A = B, direct dependency check needs to be performed. 
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To establish parallelism between a zone (1) of loop count A and its upper zone (2) of 
loop count B, where A > B, direct dependency check needs to be performed. Then, 
the zone (1) will now have to have an iteration count of (its own iteration count * 
zone (2)'s iteration count). 
5 When a zone rises like a bubble and is parallel with another zone in the primary path, 

and reaches a dependency, it is placed in a secondary path. No bubble in the 
secondary path is subjected to dependency testing. 

After a bubble has reached its highest potential, and stays put in a place in the 
secondary path, the lowest bubble in the primary path is checked for dependency on 

10 its upper fellow. 

If the upper bubble happens to have a different loop count number, then as described 
earlier, testing is carried out. In case a parallelism cannot be obtained, then this 
bubble, is clubbed with the set of bubbles ranging from its upper fellow, till and 
inclusive of the bubble up the chain with the same loop count as its upper fellow. A 

15 global i/o parameter set is created for this new coalition. Now this coalition will 

attempt to find dependencies with its upper fellow. 

The loop count for this coalition will be bounding zone's loop count* Any increase in 
the iteration count of this coalition will reflect on all zones inside it. In case a bubble 
wants to rise above another one which has a sibling/ reverse sibling link, there will be 

20 speculative parallelism. 

The algorithm should start at multiple points, one by one. These points can be 
obtained by starting from the top zone and traversing down, till a sibling split is 
reached. Then this zone should be remembered, and one of the paths taken. This 
procedure is similar to the stack saving scheme used earlier in the zonise function. 

25 Another Pre-processing step is used that loop unrolls every iterative segment of a 

CDFG that does not have conditional branch instructions inside it and whose iterative 
count is known at compile time. 



30 Appendix B 



#include<stdio . h> 

void main{) 
35 { 

int i f j,k,l; 



i = 10; 
40 j = 1* 4; 

if ( j > 5 ) 

{ 

k=f indsum (i , j ) ; 
45 1 = 4+k; 

} 

else 

{ 

k = f indsum ( i , j ) ; 
50 1 = k*10; 

} 

} 
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int f indsum(int a, int b) 
{ 

int i,j,k; 
k=4; 

f or (i=0 ; i<10 ; i++) 
k = k + 1; 

j = f indsub(k,a) ; 

return j ; 

} 



int findsub(int x,int y) 

{ 

int t; 
t - x-y; 
return (t) ; 

} 



Appendix C 



Main.s 



.file "main.c" 
gcc2_compiled. : 
.section ".text" 

.align 4 

.global main 

. type main, #f unction 

.proc 020 
main: 

!#PROLOGUE# 0 

save %sp, -128, %sp 

!#PROLOGUE# 1 



mov 


10, %o0 


St 


%o0, [%fp-20] 


mov 


4, %o0 


St 


%o0, [%fp-24] 


Id 


[%fp-24] , %o0 


cmp 


%o0, 5 


ble 


. LL3 


nop 




Id 


[%£p-20] , %o0 


Id 


[%fp-24] , %ol 


call 


findsum, 0 


nop 




St 


%o0, [%fp-28] 


Id 


[%fp-28], %o0 


add 


%o0, 4, %ol 


St 


%ol, [%fp-32] 


b 


. LL4 
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nop 




. LL3 : 




Id 


[%fp-20] , %o0 


Id 


[%fp-24] , %ol 


call 


find sum, 0 


nop 




St 


%o0, [%fp-28] 


Id 


[%fp-28] , %o0 


mov 


%o0, %o2 


sll 


%o2 , 2 , %ol 


add 


%ol, %o0, %ol 


sll 


%ol, 1, %o0 


St 


%o0, [%fp-32] 


. LL4 : 




. LL2 : 




ret 





restore 
.LLfel: 

.size main, . LLfel -main 

.ident "GCC: (GNU) 2.95.2 19991024 (release)" 



Findsum.s 

. file "f indsum. c" 
gcc2_compiled. : 
. section " . text n 

.align 4 

.global f indsum 

. type f indsum, # function 

.proc 04 
f indsum: 

!#PROLOGUE# 0 

save %sp, -128, %sp 

I#PROLOGUE# 1 



St 


%i0, 


[%fp+68] 


St 


%il, 


[%fp+72] 


mov 


4, %o0 


St 


%o0, 


[%fp-28] 


St 


%g0, 


[%fp-20] 


. LL3 : 






Id 


[%fp- 


-20] , %o0 


cmp 


%o0, 


9 


ble 


. LL6 




nop 






b 


. LL4 




nop 






. LL6 : 






Id 


[%fp- 


-28] , %o0 


add 


%o0, 


1, %ol 


St 


%ol, 


[%fp-28] 


. LL5 : 






Id 


[%fp- 


-20], %o0 


add 


%o0, 


1, %ol 


St 


%ol, 


[%fp-20] 


b 


. LL3 




nop 






. LL4 : 






Id 


[%fp- 


-28] , %o0 



Id [%fp+68] , %ol 
call f indsub, 0 
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nop 
st 
Id 



%o0, [%fp-24] 
[%fp-24] , %o0 
%o0 / %i0 
• LL2 



mov 
b 



nop 
.LL2 : 
ret 

restore 
.LLfel: 

.size findsum, . LLfel -findsum 

.ident "GCC: (GNU) 2.95.2 19991024 (release) 11 



Findsub.s 

.file " f indsub . c " 
gcc2_compiled. : 
.section ".text" 

.align 4 

.global findsub 

. type findsub, #f unction 

.proc 04 
findsub : 

l#PROLOGUE# 0 

save %sp, -120, %sp 

i#PROLOGUE# 1 

st %i0, [%fp+68] 

st %il, [%fp+72] 

Id [%fp+68] , %o0 

Id [%fp+72] , %ol 

sub %o0, %ol, %o0 

st %o0, [%fp-20] 

Id [%fp-20], %o0 

mov %o0, %i0 

b . LL2 

nop 
. LL2 : 

ret 

restore 
.LLfel: 

.size findsub, . LLfel -findsub 

.ident "GCC: (GNU) 2.95.2 19991024 (release)" 



Appendix D 



Expanded main function 

Function main BEGINS here 

save %sp -128 %sp 

mov 10 %o0 

st %o0 [%fp-20] 

mov 4 %o0 

st %o0 [%fp-24] 

Id [%fp-24] %o0 
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crop %o0 5 
ble 0 
nop 

Id [%fp-20] %o0 
5 Id [%fp-24] %ol 

Function findsum BEGINS here 

save %sp -128 %sp 
st %i0 [%fp+68] 
10 st %il [%fp+72] 

mov 4 %o0 
St %o0 [%fp-28] 
st %g0 [%fp-20] 
4 

15 Id [%fp-20] %o0 

cmp %o0 9 

ble 5 

nop 

b 6 
20 nop 

5 

Id [%fp-28] %o0 

add %o0 1 %ol 

st %ol [%fp-28] 
25 7 

Id [%fp-20] %o0 

add %o0 1 %ol 

st %ol [%fp-20] 

b 4 
30 nop 

6 

Id [%fp-28] %o0 
Id [%fp+68] %Ol 
Function findsub BEGINS here 

35 

save %sp -12 0 %sp 

st %i0 [%fp+68] 

st %il [%fp+72] 

Id [%fp+68] %o0 
40 Id [%fp+72] %ol 

sb %o0 %ol %o0 

st %o0 [%fp-20] 

Id [%fp-20] %o0 

mov %o0 %i0 
45 b 10 

nop 

10 

ret 

restore 
50 11 

Function findsub ENDS here 
findsub . LLf el -findsub 
nop 

st %o0 [%fp-24] 
55 Id [%fp-24] %o0 

mov %o0 %i0 
b 8 
nop 
8 

60 ret 

restore 
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9 

Function f indsum ENDS here 
f indsum . LLf el- finds urn 
nop 

st %o0 [%fp-28] 

Id [%fp-28] %o0 

add %o0 4 %ol 

st %ol [%fp-32] 

b 1 

nop 

0 

Id [%fp-20] %oo 
Id [%fp-24] %ol 
Function f indsum BEGINS here 

save %sp -128 %sp 

st %i0 [%fp+68] 

st %il [%fp+72] 

raov 4 %o0 

st %o0 [%fp-28] 

st %g0 [%fp-20] 

4 

Id [%fp-20] %o0 

cmp %o0 9 

ble 5 

nop 

b 6 

nop 

5 

Id [%fp-28] %o0 
add %o0 1 %ol 
st %ol [%fp-28] 
7 

Id [%fp-20] %o0 

add %o0 1 %ol 

st %ol [%fp-20] 

b 4 

nop 

6 

Id [%fp-28] %00 
Id [%fp+68] %ol 
Function findsub BEGINS here 

save %sp -120 %sp 
st %i0 [%fp+68] 
st %il [%fp+72] 
Id [%fp+68] %o0 
Id [%fp+72] %ol 
sb %o0 %ol %o0 
st %o0 [%fp-20] 
Id [%fp-20] %o0 
mov %o0 %i0 
b 10 
nop 
10 
ret 

restore 
11 

Function findsub ENDS here 
findsub . LLf el -findsub 
nop 



90 



WO 2004/072796 



PCT/US2004/003609 



st %00 [%fp-24] 

Id [%fp-24] %o0 

mov %o0 %i0 

b 8 

nop 

8 

ret 

restore 
9 

Function findsum ENDS here 
f indsum . LLf el -findsum 
nop 

st %o0 [%fp-28] 
Id [%fp-28] %o0 
mov %o0 %o2 
sll %o2 2 %ol 
add %ol %o0 %ol 
sll %ol 1 %o0 
st %o0 [%fp-32] 
1 
2 

ret 

restore 
3 

Function main ENDS here 
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15 



Appendix E 



Control flow linked list 



to main 



to_target 



begins 



ends 



i t 



to main 



tojtarget 



begins 



ends 



7T 



to mam 



to target 



begins 



ends 



i t 



to mam 



to target 



begins 



ends 



to mam 




to_target 



begins 



ends 



i t 



to mam 



totarget 



begins 



e nds 



to main / 


to_target 


-A 


begins 




ends 





Main linked list 




%g0 [%fp-20] 



%o0 



st 
k 4 

ad [%fp-20] 
cmp %o0 9 
Ible 5 
nop 

6 

nop 
5 

k ld 
add 
st 
Id 
add 
st 

*b 4 

nop 
6 

Id [%fp-28] %o0 
Id [%fp+68] %ol 
Function findsub BEGINS here 
.save %sp -12 0 %sp 
st %i0 [%fp+68] 
%il [%fp+72] 
[%fp+68] %o0 
[%fp+72] %ol 
%o0 %ol %o0 
%o0 [%fp-20] 



[%fp-28] %o0 
%o0 1 %ol 

%ol [%fp-28] 

[%fp-20] %o0 
%o0 1 %ol 

%ol [%fp-20] 



st 
Id 
Id 
sb 
st 
ret 

restore 

Function findsub ENDS here 
f indsub . LLf e 1 - f indsub 
nop 

st %o0 [%fp-24] 



20 



3 Appendix F 
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In this section the pseudo ANSI C codes for the test-bench algorithms are presented. 

Note: For an indepth-analysis and explanation on all graphics algorithms, please refer to the 
book: "Computer Graphics: Principles and Practise" Second edition in C, by Foley, van 
5 Dam, Feiner and Hughes. 

Cohen Sutherland Line Clipping 

typedef unsigned int outcode; 
1 0 enum {TOP=0xl , BOTTOM=0x2, RIGHT=0x4, LEFT=0x8} ; 

void CohenSutherlandLineClipAndDraw ( 

double xO, double yO, double xl, double yl, double xmin, double xmax, 

double ymin, double ymax, int value) 
15 /* Cohen-sutherland clipping algorithm for line P0 = (x0,y0) to PI = (xl,yl) and */ 
/* clip rectangle with diagonal from (xmin,ymin) to (xmax,ymax) */ 

{ 

/* Outcodes for P0, PI and whatever point lies outside the clip rectangle */ 
outcode outcodeO, outcode 1, outcodeOut; 
20 boolean aacept = FALSE, done = FALSE; 

outcodeO = CompOutCode (xO,yO,xmin,xmax,ymin,ymax); 
outcodel = CompOutCode (xl,yl,xmin,xmax,ymin,ymax); 
do { 

if (!(outcode0 | outcodel)) { 
25 accept = TRUE; done = TRUE; 

} else if (outcodeO & outcodel) 
done = TRUE; 

else { 

double x,y; 

3 0 outcodeOut = outcodeO?outcodeO : outcode 1 ; 

if (outcodeOut & TOP) { 

x = x0 + (xl - x0)*(ymax - yO) / (yl - yO); 
y = ymax; 
} else if (outcodeOut & BOTTOM) { 
35 x = xO + (xl- x0)*(ymin - yO) / (yl - yO); 

y = ymin; 
} else if (outcodeOut & RIGHT) { 

y = yO + (yl- y0)*(xmax - xO) / (xl - xO); 
x = xmax; 
40 } else { 

y = yO + (yl- y0)*(xmin - xO) / (xl - xO); 
x = xmin; 

} 

45 if (outcodeOut = outcodeO) { 

xO = x; yO = y; outcodeO = CompOutCode 
(x0,y0,xmin,xmax,ymin,ymax) ; 

} else { 

xl = x; yl y; outcodel = CompOutCode 

50 (x 1 ,y 1 ,xmin,xmax,ymin,ymax) ; 
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> 

} 

> while (done — FALSE); 
if(accept) 

MidpointLineReal (xO,yo,xl,yl,value); 

} 

outcode CompOutode ( 

double x, double y, double xmin, double xmax, double ymin, double ymax) 

{ 

outcode code = 0; 
if (y<ymax) 

code |= TOP; 
else if (y<ymin) 

code |= BOTTOM; 
if (x>xmax) 

code j= RIGHT; 
else if (x<xmin) 

code |= LEFT; 
return code; 

} 

void MidpointLineReal (double x0,double yo,double xl, double yl,double value) 
{ 

double dx = xl - xO; 
double dy = yl - yO; 
double d = 2*dy - dx; 
double incrE = 2*dy; 
double incrNE = 2*(dy - dx); 
double x = xO; 
double y = yO; 
WritePixel (x,y, value); 

while (x<xl) { 

if (d<=0) { 

d += incrE; 
x++; 

} else { 

d += incrNE; 
x++; 

y++; 

} 

WritePixel (x,y,value); 

} 

} 

Mid-point Ellipse Scan Conversion 

void MidpointEllipse (int a, int b, int value) 

/* Assumes center of ellipse is at the origin. Note that overflow may occur */ 
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/* for 16-bit integers because oft the squares */ 
{ 

double d2; 
intx=0; 
5 int y = b; 

double dl = b 2 - (a 2 b) + (0.25a 2 ); 

EllipsePoints (x,y, value); /* The 4-way symmetrical WritePixel */ 

while (a 2 (y - 0.5) > b 2 (x + 1)) { 
10 if(dl<0) 

dl4=b 2 (2x + 3); 

else { 

dl 4= b 2 (2x + 3) + a 2 (-2y + 2); 

y-s 

15 } 

x-H-; 

. EllipsePoints(x,y,value); 



20 d2 = b 2 (x + 0.5) 2 + a 2 (y- l) 2 -a 2 b 2 ; 

while (y > 0) { 

if(d2<0){ 

d2 += b 2 (2x + 2) + a 2 (-2y + 3); 
x++; 

25 } else 

d2+=a 2 (-2y + 3); 

y-s 

EllipsePoints(x,y, value); 

} 

30 } 

The bitBlock Transfer Algorithm 

typedef struct { 
3 5 point topLeft, bottomRight; 

} rectangle; 

typedef struct { 
cha *base; 
40 int width; 

rectangle rect; 
} bitmap; 

typedef struct { 
45 unsigned int bits:32; 

} texture; 

typedef struct { 

char *worldptr; 
50 int bit; 
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} bitPointer; 

voidbitBlt( 

bitmap mapl; 
point point 1; 
texture tex; 
bitmap map2; 
rectangle rect2; 
writeMode mode) 

{ 

int width; 
int height; 
bitPointer pi, p2; 

clip x_values; 
clip y-values; 

width = rect2.bottomRight.x - rect2.topLeft.x; 
height = rect2.bottomRight.y - rect2.topLeft.y; 

if (width < 0 || height < 0) 
return; 

pl.wordptr = mapl. base; 

pi. bit = mapl.rect.topLeftx % 32; 

/* And the first bin in the bitmap is a few bits further in */ 
/* Increment pi ainitl it points to the specified point in the first bitmap */ 
IncrementPointer (pi, point 1.x - mapl.rect.topLeftx + mapl. width * 

(point l.y - mapl.rect.topLeft.y)); 

/* Same for p2 - it points to the origin of the destination rectangle */ 

p2.worldptr = map2.base; 

p2.bit = map2.rect.topLeft.x % 32; 

IncrementPointer (p2,rect2.topLeft.x - map2.rect.topLeft.x + 

map2.widrh * (rect2.topLeft.y 

map2.rect.topLeft.y)); 
if(pl<p2) { 

/* The pointer pi comes before p2 in memory; if they are in the same bitmap 

*/ 

/* the origin of the source rectangle is either above the origin for the */ 
/* above destaination or, if at the same level, to the left of it */ 

IncrementPointer (pi, height * mapl. width + width); 

/* Now pi points to the lower right word of the rectangle */ 

IncrementPointer (p2, height * mapl. width + width); 

/* Same for p2, but the destination rectangle */ 

point 1.x += width; 

pointl .y += heigjit; 
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/* Thios point is now just beyond the lower right in the rectangle */ 
while (height- >0){ 

/* Copy rows from the source to the target bottom to top, right to left */ 

DecrementPointer (pi, map 1. width); 

DecrementPointer (p2, map2 .width); 

temp_y = pointl .y % 32; /* used to index into texture */ 

temp_x = pointl .x % 32; 

/* Now do the real bitBlt from bottom right to top left */ 
RowBltNegative (pi, p2, width, BitRotate(tex[temp_jy],temp_x), 

mode); 

} /* while */ 
}else{/*ifpl>=p2 */ 

while (height- > 0) { 

/* Copy rows fro source to destaination, top to bottom, left to right */ 
/* Do the real bitBlt, from topleft tpo bottom right */ 
RowBltPositive (same arguments as before); 
increment pointers; 
} /* while */ 
} /* else */ 
} /* bitBlt */ 

void Clip Values (bitmap *mapl, bitmap *map2, point *pointl, rectangle *rect2) 
{ 

if (*pointl not inside *mapl){ 

adjust *pointl to be inside *mapl; 

adjust origin of *rect2 by the same amount; 

if (originof *rect2 not inside *map2) { 

adjust origin of *rect2 to be inside *map2; 
adjust *pointl by the same amount; 

} 

if (opposite comer of *rect2 not inside *map2) 

adjust opposite comer of *rect2 to be inside; 
if (opposite comer of corresponding rectangle in *mapl not insode *mapl) 

adjust opposite comer of rectangle; 
} /*ClipValues */ 

void RowBltPositive( 

bitPtr pi, bitPtr p2; /* Source and destination pointers */ 
int n; /* How many bits to copy */ 

char tword; /* Texture word */ 

writeMode mode) /* Mode to bit pixels */ 

/* Copy n bits from position pi to position p2 according to the mode */ 
while (n- > 0) { 

if (BitlsSet (tword,32))/* If texture says it is OK to copy..*/ 
MoveBit (pl,p2,mode); /* then copy the bit */ 

IncrementPointer (pi); 

IncrementPointer (p2); 

RotateLeft (tword); /* Rotate bits in tword to the left */ 
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} /* while */ 
} /RowBltPositive */ 



5 Phong Shading 

double dbl=2.5,db2=65535. ,pi; 

int colors [] = { 3 , 6 , 10 , 13 , 6 , 3 , 10 , 13 , 6 , 3 , 13 , 10 } , 
d[]={640, 350,1}, 
10 i,k, 

palette [] ={000 , 010, 001, 011, 020 , 002 , 022 , 077 , 

040, 004 , 044 , 060 , 006, 066, 007 , 077} , 
x, y , x_min , x_jnax , y_min , y_max ; 
int min,sec; 
15 unsigned short random; 

main ( ) 

{ 

20 double a, b,c, 10,11,12, In, lnl, nO , nl, n2 ,p, q, r-128 , s , t , v [12] [3] ; 

int n; 



int graphdriver = DETECT, graphmode; 
25 int color; 

ini tgraph ( fcgraphdriver , &graphmode , " " ) ; 

/* for (n«0;n<16;n++) */ 

30 

#ifdef Intel 

printf ( "\n\t\t 80387 Phong Shading Demonstration Program\n") ; 

#else 

35 printf ( 11 \n\t\t\t Phong Shading Demonstration\n") ; 

#endif 

/* printf ( "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n" ) ; 
start=clock() ; */ 

40 

/* Pixel aspect ratio. Original value is 1.3 whic works with EGA*/ 
/* This is hence the version for my - ThL - EIZO VGA Card */ 
a=1.0; 

/* Screen center coordinates */ 
45 b=0.5* (d[0] -1).; /* x-position */ 

c=0 .5* (d[l] -1) ; /* y-position */ 

/* Unit length light source vector */ 

10=-l/sqrt (3.) ; 

11=10; 
50 12=-10; 

/* Ratio circumference to diameter of a circle */ 

pi=4*atan(l. ) ; 

/* A dozen vertices evenly spread over a unit sphere */ 

v[0] [0]=0; 
55 v[0] [1]=0; 

v[0] [2]=1; 

s=sqrt (5 . ) ; 

for (i=l;i<ll;i++) { 
p=pi*i/5; 
60 v[i] [0]=2*cos(p)/s; 
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v[i] [l]=2*sin(p)/s; 
vti] [2] = (l.-i%2*2)/s; 

} 

V[ll] [0]=*0; 
5 v[ll] [1]=0; 

v[ll] [2]=-l; 

/* Loop to Phong shade each pixel */ 
y_max=c+r; 
10 y__min=2*c-y_ max; 

for ( y =y__mi n / y < =y_max ; y + + ) { 
s=y-c; 
nl=s/r; 
lnl=ll*nl; 
15 s=r*r-s*s; 

x_max=b+a*sqrt (s) ; 
x__min- 2 *b -x_max ; 
for ( x=x_min ; x< =x_max ; x+ + ) { 
t«(x-b)/a; 
20 n0=t/r; 

t=sqrt (s-t*t) ; 
n2=t/r; 

/* Compute dot product and clamp to positive value */ 

In=10*n0+lnl+12*n2; 
25 if (ln<0) ln=0; 

/* cos(e.r)**27 */ 

t=ln*n2; 

t+=t-12; 

t*=t*t; 
30 t*=t*t; 

t*=t*t; 

/* Nearest vertex to normal yields max dot product */ 
/* Get its color */ 
for (i=0,p=0 ; i<ll; i++) 
35 if (p<(q=nO*v[i] [0]+nl*v[i] [l]+n2*v[i] [2] )) { 

p=q; 

k=colors [i] ; 
}/*end for*/ 

/* Aggregate ambient, diffuse, and spectacular intensities 
40 do dither */ 

random= 3 7 * r andom+ 1 ; 
i=k-dbl+dbl*ln+t+random/db2 ; 

/* Clamp values outside range of three color level to black or white 

*/ 

45 if (i < (k-2) ) i=0; 

else 

if (i > k) i=15; 
putpixel (x,y, i) ; 
}/*end for*/ 
50 }/*end for*/ 



exit : 

delay (5000) ; 
closegraph () ; 

55 

}/*end main*/ 

4 Appendix G 
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Algorithm: 

Task schedule (G(V,E), CTRL_VARS[N] , PE = (PE1.PE2 PEM}) 

For each combination of CTRL VARS do 

Generate a DFG Gsub(V t E f CTRL_VARS[I]) which is a sub-graph of G(V,E). Only the 
nodes and edges in the control flow corresponding to the current combination of 
CTRL_VARS are included in this sub-graph. 

Generate the PCP schedule of GL Let the schedule be PCPjschedflJ and the delay be 
PCP_delay[IJ. 

Sort PCPjsched and PCPjdelay and Gsub in decreasing order of PCPjdelay [IJ. 

Generate the Branch and bound schedule for Gsub[0], the sub-graph with the worst 
PCPjdelay. Let the schedule be BBjsched[I=OJ and the delay be BB__delay[I=OJ . 
Initialize worstjbbjdelay = BBjdelayfOJ 

For all the other sub-graphs do 
{ 

if (PCPjdelay [IJ < worstjbbjdelay) then 
BB_sched[I] = PCPjschedflJ; 
BBjielayflJ = PCPjdelayflJ; 



else 



Generate BBjschedflJ and BBjdelayflJ; 
If(BB_delay[IJ > worst_bbjdelay[IJJ then 
Worstjbbjdelay = BBjielayflJ; 



Generate the branching tree with the help of the G(V,E). In this branching tree, the edge 
represents the choices (Kand K') and the node represents the variable (K) 
Initialize the current path to the one leading from the top to the leaf in such a way that the 
DFG corresponding to this path gives the worst Jbbjdelay. The path is nothing but a list 
of edges tracing from the top node till the leaf 
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